Conditioning multi-view video generation on coarse, low-fidelity physics simulation states alongside visual prompts will significantly reduce physical hallucinations (e.g., object interpenetration) and improve sim-to-real policy transfer.
Motivation
While RoboVIP solves visual consistency, generative video models often ignore physical constraints, creating 'dream physics' that can degrade policy robustness. Anchoring generation to a simplified physics engine could combine the visual fidelity of GenAI with the structural integrity of simulation.
Proposed Method
Develop a 'Physics-Adapter' (similar to ControlNet) for the RoboVIP pipeline that accepts rendered depth or segmentation maps from a fast, low-fidelity simulator (e.g., MuJoCo) as structural conditioning. Generate a synthetic dataset where the visual appearance is handled by Visual Identity Prompting, but the motion dynamics are constrained by the physics engine's output. Train a manipulation policy on this data and evaluate against standard RoboVIP augmentation on contact-rich tasks.
Expected Contribution
A method to enforce physical consistency in generative video augmentation, reducing the domain gap caused by unrealistic dynamics in synthetic data.
Required Resources
Video diffusion model weights, physics simulator (Isaac Gym/MuJoCo), GPU cluster (H100s/A100s) for training adapters.
Source Paper
RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation