Iterative distillation of the final BPO-aligned policy into lower internal policies allows for significant depth pruning (model compression) with minimal loss in reasoning accuracy.
Motivation
BPO aligns lower layers to the task objective, but the final layer is still usually superior. By treating the final BPO policy as a 'teacher' and the internal BPO policies as 'students' in a recursive distillation loop, we can compress the reasoning capability into earlier layers, effectively allowing us to chop off the top X% of the model for faster inference.
Proposed Method
1. Initialize with a BPO-trained model. 2. Freeze the final layer policy. 3. Optimize the internal policy at Layer L-k to minimize KL divergence with the final layer policy on a reasoning corpus. 4. Once converged, remove layers L-k+1 through L. 5. Repeat for lower layers. 6. Compare the accuracy-latency trade-off of this 'Internal Policy Distillation' against standard quantization or pruning methods.
Expected Contribution
A method for converting 'deep' reasoning models into 'shallow' fast inference models by explicitly pushing the reasoning logic down the stack using internal policies.
Required Resources
High-end compute for iterative training/distillation cycles, large reasoning datasets (math/code).
Source Paper
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies