High divergence between the token distributions of BPO-trained internal policies and the final policy acts as a reliable unsupervised predictor of hallucination and sycophancy.
Motivation
Research suggests LLMs often encode factual knowledge in lower/middle layers but override it in later layers due to RLHF or instruction tuning (sycophancy). Since BPO forces internal layers to be predictive, the 'disagreement' (KL divergence) between a factual internal policy and a potentially biased final policy should signal when the model is drifting from knowledge to hallucination.
Proposed Method
1. Train a model using BPO. 2. Run inference on datasets known to induce hallucinations (TruthfulQA) or sycophancy. 3. Measure the per-token KL divergence between the policy at Layer K (middle) and the final Layer N. 4. Train a lightweight classifier on these divergence scores to predict whether the final generated token is factual or hallucinated.
Expected Contribution
A novel, white-box uncertainty estimation method that utilizes the specific architecture of BPO-trained models to detect hallucinations in real-time.
Required Resources
Pre-trained BPO models, hallucination/sycophancy datasets, and analysis scripts for activation/logit comparison.
Source Paper
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies