[論文レビュー] Progressive Checkerboards for Autoregressive Multiscale Image Generation
The paper introduces a multiscale autoregressive sampler using a fixed progressive checkerboard ordering to enable parallel sampling within scales while conditioning between scales, achieving competitive ImageNet 256×256 results with far fewer sampling steps.
A key challenge in autoregressive image generation is to efficiently sample independent locations in parallel, while still modeling mutual dependencies with serial conditioning. Some recent works have addressed this by conditioning between scales in a multiscale pyramid. Others have looked at parallelizing samples in a single image using regular partitions or randomized orders. In this work we examine a flexible, fixed ordering based on progressive checkerboards for multiscale autoregressive image generation. Our ordering draws samples in parallel from evenly spaced regions at each scale, maintaining full balance in all levels of a quadtree subdivision at each step. This enables effective conditioning both between and within scales. Intriguingly, we find evidence that in our balanced setting, a wide range of scale-up factors lead to similar results, so long as the total number of serial steps is constant. On class-conditional ImageNet, our method achieves competitive performance compared to recent state-of-the-art autoregressive systems with like model capacity, using fewer sampling steps.
研究の動機と目的
- Motivate and demonstrate a multiscale autoregressive sampler that samples locations in parallel within scales without losing conditioning power across scales.
- Propose a fixed progressive checkerboard ordering that maintains balance in a quadtree subdivision to control parallelism and conditioning.
- Investigate how between-scale and within-scale conditioning interact and how total sampling steps affect performance.
- Show competitive ImageNet 256×256 class-conditioned results with fewer sampling steps compared to recent autoregressive models.
提案手法
- Transformer-based autoregressor with blockwise causal mask and progressive checkerboard sampling blocks.
- Upsample latent codes from previous scale to condition current scale; split locations into P blocks and process blocks serially while sampling each block in parallel.
- Use a balanced progressive checkerboard order (divide-and-conquer with TL, BR, TR, BL diagonal pattern) to ensure spatial balance across quadtree levels.
- Train with ground-truth codes in parallel through all scales; use cross-scale inputs combining upsampled previous-scale latents and current-scale outputs with learned position embeddings.
- Experiment with RoPE mixing to attend to both current and previous-block locations; apply classifier-free guidance with a staged CFG schedule.

実験結果
リサーチクエスチョン
- RQ1How does a progressive checkerboard ordering influence parallelism and conditioning in multiscale autoregressive generation?
- RQ2How do between-scale and within-scale conditioning interact, and how does the total number of sampling steps affect performance?
- RQ3What scale-up factors best balance conditioning and parallelism for high-quality image synthesis on ImageNet?
- RQ4Can fewer sampling steps achieve competitive results compared to state-of-the-art autoregressive models?
主な発見
| Model | Type/Tok | Params | FID | IS | Pre. | Rec. | Steps | Time (s) |
|---|---|---|---|---|---|---|---|---|
| DiT-XL/2 | Diffu-KL | 675M | 2.24 | 278.2 | 0.83 | 0.57 | 1×250 | 11.9 |
| MAR-L | MAR-KL | 479M | 1.78 | 296.0 | 0.81 | 0.60 | 64×100 | 26.4 |
| GtR | MAR-KL | 479M | 1.81 | 297.4 | — | — | 32×30 | — |
| xAR | Flow-KL | 608M | 1.28 | 292.5 | 0.82 | 0.62 | 4×50 | 7.7 |
| LlamaGen-L | AR-VQ | 343M | 3.07 | 256.1 | 0.83 | 0.52 | 576 | 12.58 |
| VAR-d16 | AR-VQ | 310M | 3.30 | 274.4 | 0.84 | 0.51 | 10 | 0.12 |
| PAR-L-4x | AR-VQ | 343M | 3.76 | 218.9 | 0.84 | 0.50 | 147 | 3.38 |
| RandAR-L | AR-VQ | 343M | 2.55 | 288.8 | 0.81 | 0.58 | 88 | 1.97 |
| NAR-L | AR-VQ | 372M | 3.06 | 263.9 | 0.81 | 0.53 | 31 | 1.01 |
| ARPG-L | AR-VQ | 320M | 2.30 | 297.7 | 0.82 | 0.56 | 32 | 0.58 |
| LPD-L | AR-VQ | 337M | 2.40 | 284.5 | 0.81 | 0.57 | 20 | 0.28 |
| Checkerboard-L 2x cfg=1.4 | AR-VQ | 343M | 2.72 | 302.5 | 0.81 | 0.56 | 17 | 0.52 |
| Checkerboard-L 2x cfg=1.5 | AR-VQ | 343M | 2.83 | 318.2 | 0.82 | 0.57 | 17 | 0.52 |
| Checkerboard-L 4x cfg=1.7 | AR-VQ | 343M | 2.79 | 311.5 | 0.80 | 0.57 | 17 | 0.52 |
- A spatially balanced progressive checkerboard ordering enables parallel sampling within each scale and maintains conditioning across scales.
- For multiscale setups, the total number of sequential steps largely determines performance, with scale factors 2, 3, and 4 achieving similar results when total steps are fixed.
- Checkerboard-L models with 2x and 4x scaling achieve competitive FID/IS with far fewer steps (17 total) and faster inference times versus comparable AR-VQ methods.
- Across scale factors, using around 17 total steps yields near-optimal performance; increasing steps beyond this yields diminishing returns.
- RoPE mixing provided no clear performance gains, suggesting input conditioning suffices for early layers to extract necessary conditioning information.
- On ImageNet 256×256, Checkerboard-L models reach FID of 2.72–2.83 and IS around 302–318 with 17 steps, outperforming several PAR/RandAR baselines in step efficiency and speed.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。