WorldWarp's geometric propagation can enable 'text-driven 3D object injection' into video streams by inserting proxy geometries into the point cloud and manipulating the spatial noise mask.
Motivation
Video editing often requires complex tracking or layer separation. Since WorldWarp already explicitly handles 3D geometry and occlusions, it should be possible to insert a rough 3D shape (e.g., a cube) into the scene and have the diffusion model 'paint' it into a specific object based on a text prompt, while automatically handling the occlusion and lighting interactions with the existing video.
Proposed Method
Modify the WorldWarp pipeline to accept an auxiliary 3D mesh input. Project this mesh into the current view alongside the propagated background geometry. Create a custom noise mask that assigns maximum noise levels to the projected region of the inserted mesh and low noise to the background. Condition the diffusion refiner on a text prompt specific to the inserted region (e.g., 'a red sports car') while maintaining the global context.
Expected Contribution
A method for seamless, geometrically consistent insertion of new objects into generated videos, bridging the gap between video generation and 3D scene editing.
Required Resources
WorldWarp codebase, 3D mesh assets, text-conditioned video diffusion backbone.
Source Paper
WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion