← Back to Ideas

Pre-training video generators on the abstract reasoning subsets of MMGR (e.g., 2D geometric transformations) before fine-tuning on photorealistic video induces 'reasoning priors' that generalize to real-world physical dynamics.

Feasibility: 6 Novelty: 9

Motivation

Models struggle to learn physics solely from complex, noisy real-world video data. By adopting a curriculum learning approach—starting with the distilled, abstract logic puzzles present in MMGR—models might learn fundamental causal structures (A causes B) more efficiently than trying to disentangle them from texture and lighting in raw video.

Proposed Method

Train a small-scale Diffusion Transformer (DiT) from scratch using a curriculum: Phase 1 uses MMGR's abstract reasoning samples (synthetic shapes/logic). Phase 2 introduces low-complexity synthetic physics scenes. Phase 3 fine-tunes on real-world video. Compare the final model's performance on the MMGR 'physical commonsense' benchmark against a baseline model trained only on real-world video for the same number of steps.

Expected Contribution

Evidence that 'reasoning' in generative models is a transferrable skill that can be learned from simplified abstract domains, potentially reducing the data requirements for training physically accurate world simulators.

Required Resources

Significant compute for training/fine-tuning DiT models, curation of a synthetic curriculum dataset based on MMGR logic.

Source Paper

MMGR: Multi-Modal Generative Reasoning

View Paper Details →

←

Inference-time 'Reasoning Guidance' can be derived

→

The 'reasoning gap' identified by MMGR correlates