QUICK REVIEW

[論文レビュー] Training-Free Self-Correction for Multimodal Masked Diffusion Models

Yidong Ouyang, Panwen Hu|arXiv (Cornell University)|Feb 2, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

この論文は、事前学習済みのマルチモーダルマスク拡散モデルのトレーニング不要な自己修正フレームワークを提案し、推論時のトークン再マスクを可能にして初期の誤りを訂正します。ファインチューニングなしで、テキスト-to-画像生成とマルチモーダル理解を改善しつつ、サンプリングを高速化します。

ABSTRACT

Masked diffusion models have emerged as a powerful framework for text and multimodal generation. However, their sampling procedure updates multiple tokens simultaneously and treats generated tokens as immutable, which may lead to error accumulation when early mistakes cannot be revised. In this work, we revisit existing self-correction methods and identify limitations stemming from additional training requirements or reliance on misaligned likelihood estimates. We propose a training-free self-correction framework that exploits the inductive biases of pre-trained masked diffusion models. Without modifying model parameters or introducing auxiliary evaluators, our method significantly improves generation quality on text-to-image generation and multimodal understanding tasks with reduced sampling steps. Moreover, the proposed framework generalizes across different masked diffusion architectures, highlighting its robustness and practical applicability. Code can be found in https://github.com/huge123/FreeCorrection.

研究の動機と目的

masked diffusion モデルにおける並列・不可逆なトークン更新での誤差蓄積を調査する。
学習不要の自己修正メカニズムを開発し、事前学習済みバックボーンの帰納的バイアスを活用する。
モデルパラメータを変更せず、外部評価者を使用せずに推論時のトークン再マスクを有効にする。
マルチモーダルタスクにおける異なるマスク拡散アーキテクチャ間の頑健性と一般化を評価する。

提案手法

推論時のモデル非依存の再マスキングで、すでに生成済み位置のトークン確率を再評価する。
ステップ間の累積予測確率を用いて低信頼トークンを再マスク対象として識別する。
再マスクスケジュールに基づいて1ステップあたり一定数のトークンを再マスクし、忠実度と速度のバランスを取る。
分布的不確実性基準（KL発散、ワッサースタイン距離）を用いて再マスク対象トークンを選択するオプション。
アルゴリズム1は決定論的または確率的な再マスクのオプションを伴う、トレーニングなしの自己修正を概説する。

Figure 1: Average predicted probability of flipped tokens and correct tokens over 2000 samples. The x-axis denotes the time steps for generation (64 steps in total for text-to-image generation), while the y-axis denotes the average probability over all flipped positions and the correct position.

実験結果

リサーチクエスチョン

RQ1推論時にマルチモーダルマスク拡散モデルで低信頼トークンを識別・修正できるか？
RQ2事前学習済みバックボーンの帰納的バイアスを活用して微調整なしで効果的な再マスクを実現できるか？
RQ3決定論的 vs 確率的、累積確率 vs 現ステップ確率など、再マスク戦略が生成品質と効率にどう影響するか？
RQ4提案手法は異なるマスク拡散バックボーン間で頑健か？
RQ5再マスクに基づく自己修正を適用した場合のサンプリング効率（ステップ数の削減）への影響は？

主な発見

Method	Single	Two	Count	Color	Pos.	Attr.	Overall
Lumina-DiMOO a	0.99	0.93	0.85	0.84	0.84	0.71	0.86
Lumina-DiMOO (ReMDM)	1.00	0.94	0.86	0.87	0.82	0.74	0.87
Lumina-DiMOO (Ours)	0.99	0.94	0.88	0.93	0.87	0.79	0.90

本手法は vanilla Lumina-DiMOO および従来のトレーニング不要法に対して GenEval で一貫した改善を示す。
マルチモーダル理解ベンチマーク（MM bend、SEED-Bench、MMMU）において、ベースラインと比較して性能が向上する。
アブレーションにより累積尤度と決定論的再マスクが多くの指標で最良となることが示された。
提案手法は基準の64サンプリングステップに対して、16ステップ程度でもGenEvalで同等以上の性能を実現できる。
バックボーン（例：MMaDA-8B-MixCoT）間での一般化の証拠が得られ、一貫した利得が確認された。

Figure 2: The effectiveness of using accumulated predicted probability. The x-axis denotes the time steps for generation, while the y-axis denotes the average rank of the predicted probabilities of flipped tokens among correct tokens. The larger the rank is, the smaller the probability is.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。