← Back to Ideas

High divergence between the token distributions of BPO-trained internal policies and the final policy acts as a reliable unsupervised predictor of hallucination and sycophancy.

Feasibility: 8 Novelty: 7

Motivation

Research suggests LLMs often encode factual knowledge in lower/middle layers but override it in later layers due to RLHF or instruction tuning (sycophancy). Since BPO forces internal layers to be predictive, the 'disagreement' (KL divergence) between a factual internal policy and a potentially biased final policy should signal when the model is drifting from knowledge to hallucination.

Proposed Method

1. Train a model using BPO. 2. Run inference on datasets known to induce hallucinations (TruthfulQA) or sycophancy. 3. Measure the per-token KL divergence between the policy at Layer K (middle) and the final Layer N. 4. Train a lightweight classifier on these divergence scores to predict whether the final generated token is factual or hallucinated.

Expected Contribution

A novel, white-box uncertainty estimation method that utilizes the specific architecture of BPO-trained models to detect hallucinations in real-time.

Required Resources

Pre-trained BPO models, hallucination/sycophancy datasets, and analysis scripts for activation/logit comparison.

Source Paper

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

View Paper Details →

←

Contrastive decoding between BPO-optimized interna

→

Iterative distillation of the final BPO-aligned po