← Back to Ideas

Applying Multiple Instance Learning (MIL) on visual region proposals within the PE-AV contrastive loop will enable unsupervised pixel-level sound source localization, effectively solving the 'Cocktail Party Problem' visually.

Feasibility: 7 Novelty: 8

Motivation

Standard global contrastive learning assumes the whole image corresponds to the audio. In multi-source scenes (e.g., a street with a car honking and a dog barking), global pooling dilutes the signal. By treating image regions as a 'bag' of instances where only one correlates with the dominant sound, the model can learn to localize sounds without bounding box supervision.

Proposed Method

Modify the visual encoder to output a set of region embeddings (using an off-the-shelf region proposal network or grid features) rather than a global vector. Apply a MIL-based contrastive loss where the audio embedding is optimized to match the *best* matching visual region within the image, while pushing away regions from other images. Evaluate on the MUSIC dataset for source localization accuracy.

Expected Contribution

A method to spatially disentangle audio sources in complex visual scenes, extending PE-AV from global perception to dense, localized understanding.

Required Resources

PE-AV architecture, object detection/segmentation backbone (e.g., SAM or Faster R-CNN) to generate proposals, standard audiovisual datasets.

Source Paper

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

View Paper Details →