← Back to Ideas

Iterative distillation of the final BPO-aligned policy into lower internal policies allows for significant depth pruning (model compression) with minimal loss in reasoning accuracy.

Feasibility: 6 Novelty: 9

Motivation

BPO aligns lower layers to the task objective, but the final layer is still usually superior. By treating the final BPO policy as a 'teacher' and the internal BPO policies as 'students' in a recursive distillation loop, we can compress the reasoning capability into earlier layers, effectively allowing us to chop off the top X% of the model for faster inference.

Proposed Method

1. Initialize with a BPO-trained model. 2. Freeze the final layer policy. 3. Optimize the internal policy at Layer L-k to minimize KL divergence with the final layer policy on a reasoning corpus. 4. Once converged, remove layers L-k+1 through L. 5. Repeat for lower layers. 6. Compare the accuracy-latency trade-off of this 'Internal Policy Distillation' against standard quantization or pruning methods.

Expected Contribution

A method for converting 'deep' reasoning models into 'shallow' fast inference models by explicitly pushing the reasoning logic down the stack using internal policies.

Required Resources

High-end compute for iterative training/distillation cycles, large reasoning datasets (math/code).

Source Paper

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

View Paper Details →