GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

7.35 2601.05242 · 2026-01-08

Authors

Shih-Yang Liu; Xin Dong; Ximing Lu; Shizhe Diao; Peter Belcak; Mingjie Liu; Min-Hung Chen; Hongxu Yin; Yu-Chiang Frank Wang; Kwang-Ting Cheng; Yejin Choi; Jan Kautz; Pavlo Molchanov

Scores

6.3

Novelty

8.0

Technical

8.0

Transferability

9.0

Momentum

7.0

Evidence

5.7

Breakthrough

Rationale

This paper identifies and fixes a subtle but critical mathematical flaw in Group Relative Policy Optimization (GRPO) where aggregating multiple rewards prior to normalization causes signal collapse. As the industry pivots aggressively toward RL-based reasoning (DeepSeek-R1 style) and agentic workflows requiring multi-objective optimization (correctness, format, safety), GDPO offers an immediately applicable, high-impact refinement to standard training pipelines. While it represents an optimization fix rather than a fundamental paradigm shift, its relevance to the current bottleneck of stabilizing multi-reward RL ensures high utility across coding, math, and tool-use domains.

View on arXiv →

←

Emergent temporal abstractions in autoregressive m

→

MMGR: Multi-Modal Generative Reasoning