Skip to main content
QUICK REVIEW

[论文解读] Training-Free Self-Correction for Multimodal Masked Diffusion Models

Yidong Ouyang, Panwen Hu|arXiv (Cornell University)|Feb 2, 2026
Generative Adversarial Networks and Image Synthesis被引用 0
一句话总结

该论文提出一种针对预训练多模态掩蔽扩散模型的无训练自我纠错框架,使推理阶段可重新掩蔽令牌以修正早期错误,无需微调,从而提升文本到图像生成和多模态理解,并实现更快的采样速度。

ABSTRACT

Masked diffusion models have emerged as a powerful framework for text and multimodal generation. However, their sampling procedure updates multiple tokens simultaneously and treats generated tokens as immutable, which may lead to error accumulation when early mistakes cannot be revised. In this work, we revisit existing self-correction methods and identify limitations stemming from additional training requirements or reliance on misaligned likelihood estimates. We propose a training-free self-correction framework that exploits the inductive biases of pre-trained masked diffusion models. Without modifying model parameters or introducing auxiliary evaluators, our method significantly improves generation quality on text-to-image generation and multimodal understanding tasks with reduced sampling steps. Moreover, the proposed framework generalizes across different masked diffusion architectures, highlighting its robustness and practical applicability. Code can be found in https://github.com/huge123/FreeCorrection.

研究动机与目标

  • 在掩蔽扩散模型中并行、不可逆的令牌更新中调查错误累积现象。
  • 开发一种无训练的自我纠错机制,利用预训练骨干网络的归纳偏置。
  • 在推理阶段实现令牌重新掩蔽,而不修改模型参数或使用外部评估器。
  • 在不同的掩蔽扩散架构上评估对多模态任务的鲁棒性和泛化能力。

提出的方法

  • 推理阶段的模型无关重新掩蔽,对已生成位置的令牌概率重新评估。
  • 利用各步累计的预测概率来识别低置信度令牌并进行重新掩蔽。
  • 按重新掩蔽计划每步重新掩蔽固定数量的令牌—以在保真度与速度之间取得平衡。
  • 可选采用分布不确定性准则(KL散度、Wasserstein距离)来选择需要重新掩蔽的令牌。
  • 算法1 概述了无训练自我纠错,并提供确定性或随机重新掩蔽的选项。
Figure 1: Average predicted probability of flipped tokens and correct tokens over 2000 samples. The x-axis denotes the time steps for generation (64 steps in total for text-to-image generation), while the y-axis denotes the average probability over all flipped positions and the correct position.
Figure 1: Average predicted probability of flipped tokens and correct tokens over 2000 samples. The x-axis denotes the time steps for generation (64 steps in total for text-to-image generation), while the y-axis denotes the average probability over all flipped positions and the correct position.

实验结果

研究问题

  • RQ1在多模态掩蔽扩散模型中,训练无关的自我纠错能否在推理阶段识别并修正低置信度令牌?
  • RQ2利用预训练骨干的归纳偏置是否能够在不微调的情况下实现有效的重新掩蔽?
  • RQ3重新掩蔽策略(确定性与随机、累计概率与当前步概率)如何影响生成质量与效率?
  • RQ4所提出的方法在不同的掩蔽扩散骨干上是否具有鲁棒性?
  • RQ5应用基于重新掩蔽的自我纠错时,对采样效率(更少的步数)有何影响?

主要发现

方法单一两步计数颜色位置属性总体
Lumina-DiMOO a0.990.930.850.840.840.710.86
Lumina-DiMOO (ReMDM)1.000.940.860.870.820.740.87
Lumina-DiMOO (Ours)0.990.940.880.930.870.790.90
  • 该方法在 GenEval 上相对于原生 Lumina-DiMOO 与以往的无训练方法表现出稳定的提升。
  • 在多模态理解基准(MM Bend、SEED-Bench、MMMU)上,该方法相较基线提高了性能。
  • 消融研究表明累计似然与确定性重新掩蔽在多数指标上表现最佳。
  • 与基线相比,在仅需 16 次采样步时即可达到与基线 64 步相当或更好的 GenEval 表现。
  • 提供了跨骨干(如 MMaDA-8B-MixCoT)的泛化证据,并显示出一致的增益。
Figure 2: The effectiveness of using accumulated predicted probability. The x-axis denotes the time steps for generation, while the y-axis denotes the average rank of the predicted probabilities of flipped tokens among correct tokens. The larger the rank is, the smaller the probability is.
Figure 2: The effectiveness of using accumulated predicted probability. The x-axis denotes the time steps for generation, while the y-axis denotes the average rank of the predicted probabilities of flipped tokens among correct tokens. The larger the rank is, the smaller the probability is.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。