← Back to Ideas

Integrating a learned inverse dynamics model as a guidance term during the video diffusion sampling process will significantly reduce physical hallucinations (e.g., object interpenetration) and improve sim-to-real transfer compared to purely visual consistency constraints.

Feasibility: 6 Novelty: 9

Motivation

Video generation models optimize for visual plausibility, not physical correctness. They often generate trajectories where objects float or clip through grippers, which creates noise for the policy. Enforcing physical constraints during generation is the missing link for high-fidelity synthetic data.

Proposed Method

1. Train a lightweight inverse dynamics model on real robot proprioception data. 2. Modify the RoboVIP diffusion sampling loop to include a 'physics guidance' term: at each denoising step, estimate the implied forces/actions between frames. 3. Penalize transitions that violate the dynamics model (e.g., high residuals) via classifier-free guidance or energy-based guidance. 4. Compare policy performance trained on physics-guided vs. standard RoboVIP videos.

Expected Contribution

A method to ground generative video models in physical reality, reducing the 'reality gap' for policies trained on synthetic video data.

Required Resources

High-end GPUs for diffusion inference with gradient guidance, dataset of robot dynamics, expertise in diffusion model internals.

Source Paper

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

View Paper Details →

←

A closed-loop 'Failure-Driven Synthesis' pipeline,

→

Visual Identity Prompting can facilitate zero-shot