The PE-AV latent space can serve as a semantic bridge for 'Foley-Driven Image Animation,' where the audio embedding directly controls the motion dynamics of a static image via a diffusion adapter.
Motivation
Text-to-video models struggle to capture the rhythmic and textural nuances of sound (e.g., the specific visual vibration of a cello string). Since PE-AV achieves SOTA multimodal correspondence, its audio embeddings likely contain the necessary dynamic information to steer generative video models more accurately than text prompts.
Proposed Method
Freeze the pre-trained PE-AV audio encoder. Train a lightweight adapter network (e.g., a Q-Former or MLP) to map PE-AV audio embeddings into the conditioning space of a pre-trained video diffusion model (like Stable Video Diffusion). Train on a dataset of high-motion audiovisual clips (e.g., AIST++, VGGSound) using a reconstruction loss, ensuring the generated video motion aligns with audio transients.
Expected Contribution
A novel framework for audio-conditioned video editing that allows for rhythm-aware image animation, demonstrating the generative utility of discriminative PE-AV embeddings.
Required Resources
Pre-trained PE-AV weights, a video diffusion model (SVD/AnimateDiff), high-end GPUs for diffusion training (e.g., 8x A100s), and video-audio datasets.
Source Paper
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning