← Back to Ideas

Targeted Direct Preference Optimization (DPO) on visual-counterfactual examples can reprogram 'copying heads' to prioritize visual grounding over prompt context, offering a permanent weight-based solution.

Feasibility: 8 Novelty: 7

Motivation

The original paper proposes a training-free intervention (ablation). However, the existence of these heads suggests a misalignment in the training objective. Fine-tuning specifically to penalize the 'copying' behavior might repurpose these heads rather than needing to destroy them.

Proposed Method

Construct a 'Visual Counterfactual' dataset where prompts deliberately describe objects not present in the image. Use LoRA to fine-tune the model with DPO, using the correct visual description as 'chosen' and the prompt-compliant hallucination as 'rejected.' Crucially, restrict gradient updates to the specific attention heads identified in the original paper to see if their function can be inverted.

Expected Contribution

Demonstration that 'hallucination circuits' are plastic and can be retrained to become 'grounding circuits' through targeted data intervention.

Required Resources

Data generation pipeline for counterfactuals, compute for LoRA fine-tuning (e.g., 2-4 A100s).

Source Paper

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

View Paper Details →

←

Dynamic activation steering, triggered by an uncer

→

None