Targeted Direct Preference Optimization (DPO) on visual-counterfactual examples can reprogram 'copying heads' to prioritize visual grounding over prompt context, offering a permanent weight-based solution.
Motivation
The original paper proposes a training-free intervention (ablation). However, the existence of these heads suggests a misalignment in the training objective. Fine-tuning specifically to penalize the 'copying' behavior might repurpose these heads rather than needing to destroy them.
Proposed Method
Construct a 'Visual Counterfactual' dataset where prompts deliberately describe objects not present in the image. Use LoRA to fine-tune the model with DPO, using the correct visual description as 'chosen' and the prompt-compliant hallucination as 'rejected.' Crucially, restrict gradient updates to the specific attention heads identified in the original paper to see if their function can be inverted.
Expected Contribution
Demonstration that 'hallucination circuits' are plastic and can be retrained to become 'grounding circuits' through targeted data intervention.
Required Resources
Data generation pipeline for counterfactuals, compute for LoRA fine-tuning (e.g., 2-4 A100s).
Source Paper
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models