← Back to Ideas

Conditioning multi-view video generation on coarse, low-fidelity physics simulation states alongside visual prompts will significantly reduce physical hallucinations (e.g., object interpenetration) and improve sim-to-real policy transfer.

Feasibility: 8 Novelty: 7

Motivation

While RoboVIP solves visual consistency, generative video models often ignore physical constraints, creating 'dream physics' that can degrade policy robustness. Anchoring generation to a simplified physics engine could combine the visual fidelity of GenAI with the structural integrity of simulation.

Proposed Method

Develop a 'Physics-Adapter' (similar to ControlNet) for the RoboVIP pipeline that accepts rendered depth or segmentation maps from a fast, low-fidelity simulator (e.g., MuJoCo) as structural conditioning. Generate a synthetic dataset where the visual appearance is handled by Visual Identity Prompting, but the motion dynamics are constrained by the physics engine's output. Train a manipulation policy on this data and evaluate against standard RoboVIP augmentation on contact-rich tasks.

Expected Contribution

A method to enforce physical consistency in generative video augmentation, reducing the domain gap caused by unrealistic dynamics in synthetic data.

Required Resources

Video diffusion model weights, physics simulator (Isaac Gym/MuJoCo), GPU cluster (H100s/A100s) for training adapters.

Source Paper

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

View Paper Details →