QUICK REVIEW

[论文解读] Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

Kaiwen Zhu, Quansheng Zeng|arXiv (Cornell University)|Feb 27, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

论文引入 MIGM-Shortcut，一种轻量级神经模型，学习潜在受控动力学以预测掩码图像生成中的特征更新，在 MaskGIT 和 Lumina-DiMOO 上实现显著的加速（约4–5倍），同时对质量影响很小。它用快捷方式替代大部分重型基模型步骤，并在必要时与基模型定期重新同步以控制误差累积。

ABSTRACT

Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.

研究动机与目标

为 Masked Image Generation Models (MIGMs) 中的多步双向注意力带来的低效性提供动机与解决方案。
开发一种轻量级的快捷模型，利用既往特征和新采样的标记来预测特征演化。
在具有代表性的 MIGM 架构（MaskGIT 和 Lumina-DiMOO）上展示在受控质量影响下的加速效果。

提出的方法

将 MIGM 形式化为状态空间模型，在其中潜在特征在学习的漂移 S_theta 的条件下，根据过去特征和新解码的标记演化。
提出一个由跨注意力和自注意力层组成的轻量级快捷模型，设有瓶颈，通过正弦嵌入和自适应层归一化对时间进行条件化。
通过最小化真实下一特征与快捷预测更新之间的均方误差来训练快捷模型，同时保持基模型冻结。
推理阶段，用快捷预测替代大部分重型基模型步骤，定期用基模型刷新以防止误差累积。
提供实证证据表明特征轨迹平滑，采样信息对动力学具有关键作用，从而为快捷设计提供合理性。

实验结果

研究问题

RQ1当同时以先前特征和采样标记为条件时，轻量级潜在动态模型是否能够准确预测 MIGMs 的特征演化？
RQ2在 MIGMs（MaskGIT 和 Lumina-DiMOO）中可以在多大程度上实现加速而对生成质量几乎不产生下降？
RQ3通过快捷模型中的跨注意力引入采样信息是否会显著影响性能？
RQ4在固定计算预算下，快捷模型的复杂度与加速增益之间的权衡如何？

主要发现

Method	Configuration	Latency (ms) ↓	Speedup ↑	FID ↓
Vanilla	8 steps	26.1	1.92 ×	9.91
Vanilla	9 steps	29.4	1.70 ×	8.86
Vanilla	11 steps	35.9	1.40 ×	7.90
Vanilla	13 steps	42.5	1.18 ×	7.64
Vanilla	15 steps	50.1	1.00 ×	7.60
Vanilla	32 steps	104.6	0.48 ×	8.08
Shortcut	15 steps, B=7	25.9	1.94 ×	8.90
Shortcut	15 steps, B=8	28.8	1.74 ×	8.16
Shortcut	32 steps, B=8	33.7	1.49 ×	7.30
Shortcut	32 steps, B=9	36.8	1.36 ×	6.97
Shortcut	32 steps, B=12	45.9	1.09 ×	6.84

MIGM-Shortcut 在 Lumina-DiMOO 中实现了大约 4× 的加速，且文本到图像生成的质量损失很小。
在 MaskGIT 中，快捷模型在更快的速度下获得更稳健的图像，在类似步数下优于原生配置。
在 Lumina-DiMOO 中，DiMOO-Shortcut 达到 4–5× 的加速，且 ImageReward、CLIPScore 与 UniPercept-IQA 指标具有竞争力。
在以新解码的标记为条件时，轻量骨干（跨注意力 + 带瓶颈的自注意力）足以捕获潜在动力学。
推理过程中定期与基模型重新同步可缓解快捷预测带来的误差累积。
消融研究确认纳入采样信息的重要性，并显示默认快捷设计的帕累托最优性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。