← Back to Papers

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

7.63 2512.19687 · 2025-12-22

Authors

Apoorv Vyas; Heng-Jui Chang; Cheng-Fu Yang; Po-Yao Huang; Luya Gao; Julius Richter; Sanyuan Chen; Matt Le; Piotr Dollár; Christoph Feichtenhofer; Ann Lee; Wei-Ning Hsu

Scores

7.0
Novelty
8.0
Technical
7.0
Transferability
9.0
Momentum
7.7
Evidence
7.3
Breakthrough

Rationale

The paper presents PE-AV, a novel set of encoders that advance audiovisual perception using large-scale multimodal correspondence learning. This work pushes the boundaries of cross-modal embeddings and introduces new tasks like speech retrieval, which broadens its applicability. The use of a large, diverse audiovisual dataset and the application of multiple contrastive objectives are technically significant, addressing data diversity and zero-shot performance issues. The approach aligns well with current research trends in multimodal learning and achieves state-of-the-art results on standard benchmarks, supporting its robustness. Its impact on multimodal AI tasks suggests a strong potential for continued influence in the field.