Visual Identity Prompting can facilitate zero-shot cross-embodiment transfer by visually 're-skinning' interaction trajectories from a source robot to a target robot morphology without explicit kinematic retargeting.
Motivation
Collecting data for every new robot arm is expensive. If RoboVIP can disentangle the 'motion/task' from the 'actor appearance' effectively, we could reuse large datasets from common robots (e.g., Franka) to train policies for rare or new robots by simply swapping the visual identity prompt.
Proposed Method
1. Collect a dataset of successful tasks performed by Robot A. 2. Define a visual identity prompt for Robot B (using a few static images of Robot B). 3. Use RoboVIP in a video-to-video mode (using ControlNet or similar structure preservation) to regenerate Robot A's videos so they appear to be performed by Robot B. 4. Train a visual policy for Robot B using only this synthetic data and evaluate on real hardware.
Expected Contribution
Validation of generative video as a tool for visual kinematic retargeting, potentially solving the data scarcity problem for new robot hardware.
Required Resources
Datasets from two different robot morphologies, RoboVIP model, significant GPU compute for video-to-video translation.
Source Paper
RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation