[論文レビュー] Accelerating Masked Image Generation by Learning Latent Controlled Dynamics
Paper introduces MIGM-Shortcut, a lightweight neural model that learns latent controlled dynamics to predict feature updates in masked image generation, enabling substantial speedups (up to ~4–5x) with minimal quality loss across MaskGIT and Lumina-DiMOO. It replaces most heavy base-model steps with the shortcut while periodically re-syncing with the base model to control error accumulation.
Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.
研究の動機と目的
- Motivate and address inefficiency in Masked Image Generation Models (MIGMs) due to multi-step bi-directional attention.
- Develop a lightweight shortcut model that leverages both past features and newly sampled tokens to predict feature evolution.
- Demonstrate acceleration on representative MIGM architectures (MaskGIT and Lumina-DiMOO) with controlled quality impact.
提案手法
- Formulate MIGM as a state-space model where latent features evolve under a learned drift S_theta conditioned on past features and newly decoded tokens.
- Propose a lightweight shortcut model consisting of cross-attention and self-attention layers with a bottleneck, conditioned on time via sinusoidal embeddings and adaptive layer norm.
- Train the shortcut by minimizing MSE between the true next-feature and the shortcut-predicted update, while keeping the base model frozen.
- During inference, replace most heavy base-model steps with the shortcut predictions, periodically refreshing with the base model to prevent error accumulation.
- Provide empirical evidence that feature trajectories are smooth and that the sampling process critically informs dynamics, justifying the shortcut design.
実験結果
リサーチクエスチョン
- RQ1Can a lightweight latent dynamics model accurately predict feature evolution in MIGMs when conditioned on both previous features and sampled tokens?
- RQ2How much acceleration can be achieved in MIGMs (MaskGIT and Lumina-DiMOO) with negligible degradation in generation quality?
- RQ3Does incorporating sampling information via cross-attention in the shortcut model substantially impact performance?
- RQ4What is the trade-off between shortcut model complexity and acceleration gains under fixed computational budgets?
主な発見
| Method | Configuration | Latency (ms) ↓ | Speedup ↑ | FID ↓ |
|---|---|---|---|---|
| Vanilla | 8 steps | 26.1 | 1.92 × | 9.91 |
| Vanilla | 9 steps | 29.4 | 1.70 × | 8.86 |
| Vanilla | 11 steps | 35.9 | 1.40 × | 7.90 |
| Vanilla | 13 steps | 42.5 | 1.18 × | 7.64 |
| Vanilla | 15 steps | 50.1 | 1.00 × | 7.60 |
| Vanilla | 32 steps | 104.6 | 0.48 × | 8.08 |
| Shortcut | 15 steps, B=7 | 25.9 | 1.94 × | 8.90 |
| Shortcut | 15 steps, B=8 | 28.8 | 1.74 × | 8.16 |
| Shortcut | 32 steps, B=8 | 33.7 | 1.49 × | 7.30 |
| Shortcut | 32 steps, B=9 | 36.8 | 1.36 × | 6.97 |
| Shortcut | 32 steps, B=12 | 45.9 | 1.09 × | 6.84 |
- MIGM-Shortcut achieves up to around 4× acceleration in Lumina-DiMOO with negligible quality loss on text-to-image generation.
- In MaskGIT, the shortcut yields consistent better images at faster speeds, surpassing vanilla configurations at comparable step counts.
- In Lumina-DiMOO, DiMOO-Shortcut reaches 4–5× speedup with competitive ImageReward, CLIPScore, and UniPercept-IQA metrics.
- A lightweight backbone (cross-attention + self-attention with bottleneck) suffices to capture latent dynamics when conditioned on newly decoded tokens.
- Regularly re-syncing with the base model during inference mitigates error accumulation from the shortcut predictions.
- Ablation confirms the importance of incorporating sampling information and shows Pareto-optimality of the default shortcut design.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。