← Back to Ideas

Contrastive decoding between BPO-optimized internal layer policies and the final layer policy will significantly outperform standard decoding on reasoning tasks by suppressing superficial token correlations.

Feasibility: 9 Novelty: 8

Motivation

Contrastive Decoding (CD) typically requires a separate 'amateur' model to penalize generic text patterns. Since BPO explicitly trains internal layers to function as coherent policies (which are naturally weaker than the full model), these internal layers can serve as the perfect intrinsic 'amateur' for CD without requiring external models or additional memory.

Proposed Method

1. Fine-tune a standard LLM (e.g., Llama-3-8B) using the Bottom-up Policy Optimization (BPO) objective. 2. During inference, project the hidden states of an intermediate layer (e.g., layer N/2) to the vocabulary to obtain 'internal logits'. 3. Apply the Contrastive Decoding formula: Logits_new = (1 + alpha) * Logits_final - alpha * Logits_internal. 4. Evaluate performance on reasoning benchmarks (GSM8K, HellaSwag) against standard decoding and standard CD with a smaller external model.

Expected Contribution

A parameter-efficient inference strategy that leverages the BPO training structure to boost reasoning capabilities without increasing inference latency or VRAM usage (compared to standard CD).

Required Resources

Access to Llama-series models, GPU compute for BPO fine-tuning (e.g., 4x A100s), and standard reasoning evaluation datasets.

Source Paper

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

View Paper Details →