← Back to Ideas

Self-Supervised Cross-Modal Imputation within the Diffusion Process can maintain policy performance during force sensor dropouts or visual occlusions.

Feasibility: 9 Novelty: 8

Motivation

The paper assumes availability of both modalities, but in real scenarios, cameras get occluded by the arm, and force sensors drift or fail. Since diffusion models are generative, they should be capable of 'hallucinating' the missing modality (e.g., inferring contact forces solely from visual deformation or arm kinematics) if explicitly trained to do so.

Proposed Method

Modify the ImplicitRDP training loop to apply random 'modality dropout' (masking either vision or force inputs) while keeping the diffusion target (action) constant. Additionally, add an auxiliary loss term that forces the model to reconstruct the masked input modality from the visible one in the latent space. Evaluate the policy's success rate when one sensor is effectively turned off during inference.

Expected Contribution

A robust multi-modal policy architecture that degrades gracefully rather than failing catastrophically during sensor malfunction.

Required Resources

Standard training compute, existing ImplicitRDP dataset re-processed with masking.

Source Paper

ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning

View Paper Details →