GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Authors
Shih-Yang Liu; Xin Dong; Ximing Lu; Shizhe Diao; Peter Belcak; Mingjie Liu; Min-Hung Chen; Hongxu Yin; Yu-Chiang Frank Wang; Kwang-Ting Cheng; Yejin Choi; Jan Kautz; Pavlo Molchanov
Scores
Rationale
This paper identifies and fixes a subtle but critical mathematical flaw in Group Relative Policy Optimization (GRPO) where aggregating multiple rewards prior to normalization causes signal collapse. As the industry pivots aggressively toward RL-based reasoning (DeepSeek-R1 style) and agentic workflows requiring multi-objective optimization (correctness, format, safety), GDPO offers an immediately applicable, high-impact refinement to standard training pipelines. While it represents an optimization fix rather than a fundamental paradigm shift, its relevance to the current bottleneck of stabilizing multi-reward RL ensures high utility across coding, math, and tool-use domains.