The attention heads responsible for prompt-induced hallucination are polysemantic and share circuitry with Optical Character Recognition (OCR) capabilities, meaning static ablation will degrade performance on text-rich images.
Motivation
The original paper suggests ablating 'copying heads' to reduce hallucination. However, mechanisms that attend strongly to text features (whether in the prompt or the image) are likely reused. If these heads are crucial for reading text within images (OCR), ablating them constitutes a harmful trade-off rather than a pure fix.
Proposed Method
First, replicate the identification of 'copying heads' using the object counting task. Second, evaluate the model's performance on OCR benchmarks (e.g., TextVQA, OCR-Bench) before and after ablating these specific heads. Third, analyze attention maps to see if these heads shift attention from prompt-tokens to image-text-tokens when presented with text-rich scenes.
Expected Contribution
This would establish a critical boundary condition for the 'ablation' safety technique, proving that hallucination and text-reading capabilities are mechanistically entangled.
Required Resources
Access to open-weights VLMs (e.g., LLaVA, InstructBLIP), OCR benchmark datasets, and interpretability libraries (e.g., TransformerLens).
Source Paper
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models