Inference-time 'Reasoning Guidance' can be derived by converting MMGR's logical constraints into differentiable energy functions, significantly improving physical consistency in video generation without retraining.
Motivation
Current video generation models optimize for perceptual quality (FVD) rather than logical consistency, often hallucinating physically impossible transitions. While MMGR evaluates these failures, it doesn't solve them; using the benchmark's constraints as an active guidance signal during the diffusion reverse process could bridge the gap between visual fidelity and reasoning.
Proposed Method
Develop a set of differentiable logic functions corresponding to MMGR's categories (e.g., object permanence, gravity). During the sampling phase of a latent diffusion video model, calculate the gradient of the generated latent regarding these logic functions (similar to classifier-free guidance). Update the latent vector at each denoising step to minimize the 'reasoning error' while maintaining the original text-condition.
Expected Contribution
A plug-and-play inference method that boosts MMGR scores for existing models (like Sora or Stable Video Diffusion) and demonstrates that reasoning can be imposed as a constraint rather than just learned from data.
Required Resources
Access to weights of a state-of-the-art open-source video model (e.g., SVD or AnimateDiff), high-end GPU cluster for inference experimentation, and the MMGR dataset definitions.