The 'reasoning gap' identified by MMGR correlates strongly with the failure of generative models to serve as World Models for Reinforcement Learning agents, implying MMGR scores can predict downstream RL transferability.
Motivation
There is a push to use video generation models as simulators for training robots/agents. However, if a model fails MMGR (e.g., walls disappear), the RL agent learns invalid policies. Establishing a correlation between MMGR scores and RL agent success rates would validate MMGR as the standard metric for 'World Model' viability, beyond just image generation.
Proposed Method
Select 3 video generation models with varying MMGR scores. Use these models to generate synthetic environments (dreaming) for an offline RL agent trained on navigation tasks (Embodied Navigation subset of MMGR). Evaluate the agent's zero-shot performance in a true ground-truth simulator. Analyze the correlation between the generator's MMGR score and the agent's success rate.
Expected Contribution
Validation of MMGR as a proxy metric for Embodied AI utility, shifting the evaluation focus of video generation from 'watching' to 'acting'.
Required Resources
RL training pipeline, pre-trained video models, standard RL environments (e.g., Habitat or distinct navigation grids).