← Back to Papers

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

7.35 2601.05242 · 2026-01-08

Authors

Shih-Yang Liu; Xin Dong; Ximing Lu; Shizhe Diao; Peter Belcak; Mingjie Liu; Min-Hung Chen; Hongxu Yin; Yu-Chiang Frank Wang; Kwang-Ting Cheng; Yejin Choi; Jan Kautz; Pavlo Molchanov

Scores

6.3
Novelty
8.0
Technical
8.0
Transferability
9.0
Momentum
7.0
Evidence
5.7
Breakthrough

Rationale

This paper identifies and fixes a subtle but critical mathematical flaw in Group Relative Policy Optimization (GRPO) where aggregating multiple rewards prior to normalization causes signal collapse. As the industry pivots aggressively toward RL-based reasoning (DeepSeek-R1 style) and agentic workflows requiring multi-objective optimization (correctness, format, safety), GDPO offers an immediately applicable, high-impact refinement to standard training pipelines. While it represents an optimization fix rather than a fundamental paradigm shift, its relevance to the current bottleneck of stabilizing multi-reward RL ensures high utility across coding, math, and tool-use domains.