Pre-training video generators on the abstract reasoning subsets of MMGR (e.g., 2D geometric transformations) before fine-tuning on photorealistic video induces 'reasoning priors' that generalize to real-world physical dynamics.
Motivation
Models struggle to learn physics solely from complex, noisy real-world video data. By adopting a curriculum learning approach—starting with the distilled, abstract logic puzzles present in MMGR—models might learn fundamental causal structures (A causes B) more efficiently than trying to disentangle them from texture and lighting in raw video.
Proposed Method
Train a small-scale Diffusion Transformer (DiT) from scratch using a curriculum: Phase 1 uses MMGR's abstract reasoning samples (synthetic shapes/logic). Phase 2 introduces low-complexity synthetic physics scenes. Phase 3 fine-tunes on real-world video. Compare the final model's performance on the MMGR 'physical commonsense' benchmark against a baseline model trained only on real-world video for the same number of steps.
Expected Contribution
Evidence that 'reasoning' in generative models is a transferrable skill that can be learned from simplified abstract domains, potentially reducing the data requirements for training physically accurate world simulators.
Required Resources
Significant compute for training/fine-tuning DiT models, curation of a synthetic curriculum dataset based on MMGR logic.