← Back to Ideas

Pareto-GDPO: Utilizing decoupled reward statistics to perform gradient projection (PCGrad) rather than scalar aggregation.

Feasibility: 6 Novelty: 9

Motivation

GDPO solves the normalization issue but still relies on a linear combination (summation) of rewards for the final update. This assumes the gradient directions of different rewards (e.g., Safety vs. Helpfulness) are not conflicting. In reality, maximizing one often hurts the other ('alignment tax'), and simple summation can lead to optimization zigzagging or sub-optimal compromises.

Proposed Method

Instead of summing the normalized rewards into a scalar advantage, treat the decoupled normalized advantages as distinct objective vectors. Apply Gradient Projection (like PCGrad) or Multi-Objective Gradient Descent (MOGD) at the advantage level: if the gradient of Reward A conflicts (negative cosine similarity) with Reward B, project the gradient of A onto the normal plane of B. This ensures no objective is degraded while optimizing others.

Expected Contribution

A method to find the true Pareto frontier in multi-reward RLHF, resulting in models that are safer without sacrificing reasoning capability, solving the 'alignment tax' problem more effectively than scalarization.

Required Resources

High-end GPU clusters (gradient projection adds computational overhead), access to aligned LLM base models, and evaluation benchmarks for both safety and capability.

Source Paper

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

View Paper Details →

←

Signal-to-Noise Weighted GDPO: Dynamically scaling

→

Temporal-GDPO: Decoupling normalization across tem