← Back to Papers

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

7.03 2601.05241 · 2026-01-08

Authors

Boyang Wang; Haoran Zhang; Shujie Zhang; Jinkun Hao; Mingda Jia; Qi Lv; Yucheng Mao; Zhaoyang Lyu; Jia Zeng; Xudong Xu; Jiangmiao Pang

Scores

6.3
Novelty
7.3
Technical
6.7
Transferability
8.8
Momentum
7.2
Evidence
5.8
Breakthrough

Rationale

This paper addresses the critical bottleneck of data scarcity in robotics by leveraging generative video models for augmentation, specifically tackling the challenges of multi-view consistency and temporal coherence which are essential for modern policies. By introducing 'Visual Identity Prompting,' it moves beyond ambiguous text prompts to precise visual control, aligning perfectly with the high-momentum trend of using GenAI to scale embodied intelligence. While the approach applies existing generative techniques (likely similar to IP-Adapter or video diffusion) to a new domain, the integration into a working pipeline that demonstrates real-world policy improvement makes it practically significant, though long-term impact depends on how well video generation competes with or complements physics-based simulation.