Signal-to-Noise Weighted GDPO: Dynamically scaling decoupled reward components based on their training stability to prevent 'solved' objectives from injecting noise.
Motivation
While GDPO fixes the magnitude imbalance between rewards, it treats all normalized rewards as equally informative throughout training. However, 'easy' objectives (e.g., output formatting) are often mastered early; continuing to normalize and sum these effectively injects high-variance noise into the gradient once the agent reaches saturation, potentially destabilizing the learning of harder objectives (e.g., logical reasoning).
Proposed Method
Implement a dynamic weighting mechanism where the coefficient of each decoupled reward component is inversely proportional to the stationarity of its policy distribution or its running variance. Specifically, track the KL divergence of the policy with respect to specific reward components; as the policy stabilizes on the 'format' reward, decay its weight in the aggregate sum, allowing the optimizer to focus purely on the 'reasoning' reward signal.
Expected Contribution
A more sample-efficient training pipeline for multi-objective LLMs that automatically transitions focus from syntax/safety to complex reasoning without manual curriculum scheduling.
Required Resources
Standard RLHF/RL training setup (e.g., vLLM, DeepSpeed), multi-reward datasets (like GSM8K with added format constraints), and GPU compute for experimentation.
Source Paper
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization