Entropy-Driven Curriculum Learning for Masked Diffusion Training
Motivation
The paper uses Denoising Entropy at inference time to guide decoding. However, this signal suggests that certain token dependencies are harder to learn than others. Incorporating this uncertainty metric into the training phase could force the model to focus on 'hard' masking patterns earlier, rather than relying on uniform random masking.
Proposed Method
Modify the training loop of an MDM. Periodically compute the Denoising Entropy of the training set (or a batch subset) using the current model checkpoint. Instead of uniform random masking, sample masks proportional to the entropy map—masking high-entropy regions more frequently to force the model to learn robust representations for difficult features.
Expected Contribution
Improved sample efficiency during training and better handling of complex dependencies (e.g., hands in images or logical connectors in text) in the final model.
Required Resources
High-performance computing cluster for model training, standard MDM datasets (ImageNet or OpenWebText).
Source Paper
Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty