Applying Multiple Instance Learning (MIL) on visual region proposals within the PE-AV contrastive loop will enable unsupervised pixel-level sound source localization, effectively solving the 'Cocktail Party Problem' visually.
Motivation
Standard global contrastive learning assumes the whole image corresponds to the audio. In multi-source scenes (e.g., a street with a car honking and a dog barking), global pooling dilutes the signal. By treating image regions as a 'bag' of instances where only one correlates with the dominant sound, the model can learn to localize sounds without bounding box supervision.
Proposed Method
Modify the visual encoder to output a set of region embeddings (using an off-the-shelf region proposal network or grid features) rather than a global vector. Apply a MIL-based contrastive loss where the audio embedding is optimized to match the *best* matching visual region within the image, while pushing away regions from other images. Evaluate on the MUSIC dataset for source localization accuracy.
Expected Contribution
A method to spatially disentangle audio sources in complex visual scenes, extending PE-AV from global perception to dense, localized understanding.
Required Resources
PE-AV architecture, object detection/segmentation backbone (e.g., SAM or Faster R-CNN) to generate proposals, standard audiovisual datasets.
Source Paper
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning