← Back to Ideas

The attention heads responsible for prompt-induced hallucination are polysemantic and share circuitry with Optical Character Recognition (OCR) capabilities, meaning static ablation will degrade performance on text-rich images.

Feasibility: 9 Novelty: 8

Motivation

The original paper suggests ablating 'copying heads' to reduce hallucination. However, mechanisms that attend strongly to text features (whether in the prompt or the image) are likely reused. If these heads are crucial for reading text within images (OCR), ablating them constitutes a harmful trade-off rather than a pure fix.

Proposed Method

First, replicate the identification of 'copying heads' using the object counting task. Second, evaluate the model's performance on OCR benchmarks (e.g., TextVQA, OCR-Bench) before and after ablating these specific heads. Third, analyze attention maps to see if these heads shift attention from prompt-tokens to image-text-tokens when presented with text-rich scenes.

Expected Contribution

This would establish a critical boundary condition for the 'ablation' safety technique, proving that hallucination and text-reading capabilities are mechanistically entangled.

Required Resources

Access to open-weights VLMs (e.g., LLaVA, InstructBLIP), OCR benchmark datasets, and interpretability libraries (e.g., TransformerLens).

Source Paper

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

View Paper Details →

←

Topological robustness can be distilled into stand

→

Dynamic activation steering, triggered by an uncer