← Back to Ideas

Temporal-GDPO: Decoupling normalization across temporal horizons for Process Reward Models (PRMs) in Chain-of-Thought reasoning.

Feasibility: 8 Novelty: 8

Motivation

In reasoning tasks (DeepSeek-R1 style), agents receive dense 'process' rewards (step-by-step logic) and sparse 'outcome' rewards (final answer). Standard GDPO normalizes across the 'group' of generated samples. However, process rewards have fundamentally different variance profiles than outcome rewards across the time dimension (early steps vs. late steps).

Proposed Method

Extend GDPO to decouple normalization not just by reward *type*, but by *temporal stage* or step index within the Chain-of-Thought. Normalize step-level rewards relative to the distribution of rewards at that specific reasoning step across the group, rather than globally. This prevents high-variance final-step rewards from washing out subtle logic improvements in early reasoning steps.

Expected Contribution

Significant improvements in training stability for long-horizon reasoning tasks (math, coding), ensuring that the agent learns robust opening strategies and intermediate logic steps effectively.

Required Resources

Datasets with step-level annotations (e.g., PRM800K), RL training framework supporting token-level rewards, and moderate compute resources.

Source Paper

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

View Paper Details →