← Back to Ideas

Cross-modal alignment of emergent temporal abstractions allows video-trained autoregressive models to guide exploration in proprioception-based RL agents without shared training data.

Feasibility: 6 Novelty: 9

Motivation

If temporal abstractions capture fundamental causal structures (e.g., 'open door', 'pick object'), these structures should be invariant across modalities. Proving that abstractions from a video-prediction AR model can zero-shot guide the exploration of a robot trained only on joint states would be a massive breakthrough for embodied AI.

Proposed Method

1. Pre-train a large autoregressive video prediction model on human manipulation datasets (e.g., Ego4D). 2. Extract the temporal abstraction signals (attention shifts/segment boundaries). 3. Train a separate RL agent on state-based inputs (proprioception) for a similar task. 4. Use the video model's abstraction timing to trigger 'sub-goal' changes or intrinsic rewards for the RL agent, effectively synchronizing the RL exploration rhythm with human video demonstrations.

Expected Contribution

Demonstration that the 'temporal abstractions' identified in the paper are semantic and transferable across modalities, not just statistical artifacts of a specific input modality.

Required Resources

Large-scale video datasets; significant compute for video model pre-training; robotics simulation environment (MuJoCo/Isaac Gym).

Source Paper

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

View Paper Details →

←

The granularity of emergent temporal abstractions

→

Injecting gradient noise aligned with the negative