Dynamic activation steering, triggered by an uncertainty-based classifier, can suppress prompt-induced hallucinations without the general performance degradation caused by permanent head ablation.
Motivation
Permanent ablation is a blunt instrument that removes model capacity. A more sophisticated approach involves detecting when the model is relying on the 'copying mechanism' versus legitimate reasoning, and applying an opposing steering vector only during those specific inference steps.
Proposed Method
Train a lightweight linear probe on the residual stream activations of the identified 'hallucination heads' to distinguish between 'visual-reliance' and 'prompt-reliance' states. Implement an inference-time intervention where a steering vector (calculated to oppose the prompt-reliance direction) is added to the activations only when the probe detects high prompt-reliance despite conflicting visual evidence.
Expected Contribution
A precise, inference-time control method that reduces hallucination rates while preserving the model's ability to follow complex text instructions when no visual conflict exists.
Required Resources
High-end GPUs for inference/steering experiments, a dataset of conflicting vs. consistent image-text pairs.
Source Paper
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models