← Back to Ideas

Incorporating a 'Temporal Jitter' auxiliary objective into the PE-AV framework will enable zero-shot fine-grained event synchronization, allowing the model to distinguish between synchronous and asynchronous audiovisual events (e.g., lip-syncing, ball bounces) without explicit temporal supervision.

Feasibility: 8 Novelty: 7

Motivation

Current large-scale correspondence learning (like PE-AV) optimizes for global semantic alignment (does this video contain a dog?) but often ignores fine-grained temporal alignment (is the bark synchronized with the jaw movement?). Addressing this allows the model to be used for deepfake detection and automated dubbing verification.

Proposed Method

Extend the PE-AV training pipeline by introducing a self-supervised temporal pretext task. For a given positive video-audio pair, generate hard negatives by shifting the audio track by small offsets (+/- 200ms to 500ms). Train the model to maximize similarity for the synchronized pair while minimizing it for the temporally jittered versions, forcing the encoders to attend to motion-audio onset simultaneity rather than just semantic content.

Expected Contribution

A robust audiovisual model capable of zero-shot active speaker detection and synchronization verification, surpassing current semantic-only baselines.

Required Resources

Access to the original PE-AV training dataset (or subset like AudioSet/VGGSound), GPU cluster for retraining/fine-tuning (e.g., 4-8 A100s), and evaluation benchmarks for AV sync (e.g., LRS3).

Source Paper

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

View Paper Details →