RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

7.59 2601.05241 · 2026-01-08

Authors

Boyang Wang; Haoran Zhang; Shujie Zhang; Jinkun Hao; Mingda Jia; Qi Lv; Yucheng Mao; Zhaoyang Lyu; Jia Zeng; Xudong Xu; Jiangmiao Pang

Scores

6.8

Novelty

8.0

Technical

7.0

Transferability

9.0

Momentum

8.0

Evidence

6.8

Breakthrough

Rationale

This work addresses the critical bottleneck of data scarcity in robotics by leveraging generative video models to create synthetic training data. It advances beyond standard text-to-image augmentation by solving for multi-view consistency and temporal coherence—essential requirements for modern robot policies—while using visual identity prompting for precise scene control. The alignment with the exploding trend of synthetic data for embodied AI, combined with validation on real-world hardware, marks it as a highly relevant and practical contribution to scaling robot learning.

View on arXiv →

←

Pushing the Frontier of Audiovisual Perception wit

→

FedHypeVAE: Federated Learning with Hypernetwork G