[论文解读] CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration
CIAR introduces an on-device interval-based uncertainty quantifier and cloud-enhanced decoding to accelerate autoregressive image generation, achieving about 2.18× speed-up and 70% fewer cloud requests while preserving image quality.
Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework extbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: extit{the vast token vocabulary} required for high-fidelity images and extit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70\%, while preserving image quality compared to existing methods.
研究动机与目标
- Motivate on-device acceleration for high-fidelity visual AR models with large token vocabularies and spatial redundancy.
- Develop an interval-based uncertainty quantifier (Inter-Head) to selectively verify tokens and reduce unnecessary cloud communication.
- Design interval-enhanced cloud decoding and a distribution alignment training strategy to maintain coherence between device and cloud outputs.
- Demonstrate speedups and reduced cloud usage without sacrificing visual fidelity on standard benchmarks.
提出的方法
- Propose on-device Interval Head (Inter-Head) that outputs center and radius logits to form probability intervals for each token.
- Define a probability interval p_t^l, p_t^u and an interval-based uncertainty score that combines total interval width and dispersion.
- Introduce Cloud-Enhanced decoding with prefix injection and intervalFeature conditioning to align device and cloud distributions during decoding.
- Adopt an interval-aware Distributionally Robust Optimization (Inter-DRO) loss to train the Inter-Head for distribution alignment with the cloud model.
- Implement interval feature projection to conditioning the cloud decoder, reducing drift and improving coherence.
- Conduct extensive experiments on multiple cloud models (LlamaGen-XL stages I/II, Anole) with MS-COCO captions as prompts.

实验结果
研究问题
- RQ1How can interval-based uncertainty estimation on-device reduce redundant verification in cloud-device AR image generation?
- RQ2Can interval-enhanced decoding with distribution alignment maintain image fidelity while reducing cloud interactions?
- RQ3What is the trade-off between prefix guidance rate and latency when using cloud-prefix injection in CIAR?
- RQ4How does continuous interval-based uncertainty compare to discrete solution enumeration in terms of latency and quality for large token vocabularies?
主要发现
| Metric | Models | Methods | CLIP (↑) | FID (↓) | F1(↑) | HPSv2(↑) | Latency(s) | steps | Cloud Call |
|---|---|---|---|---|---|---|---|---|---|
| Base | LlamaGen(Stage I) | Base | 0.3161 | 23.6900 | 0.6097 | 22.74 | x1.00 | x1.00 | 100.00% |
| Eagle2 | LlamaGen(Stage I) | Ours | 0.3159 | 24.2459 | 0.5997 | 22.48 | x2.53 | x3.00 | 30.44% |
| Lantern | LlamaGen(Stage I) | Ours | 0.3159 | 24.5828 | 0.5796 | 22.03 | x1.70 | x2.05 | 52.34% |
| Entropy-Lens | LlamaGen(Stage I) | Ours | 0.3132 | 24.2459? | 0.5997? | 22.48 | x2.53 | x3.00 | 30.44% |
| CoDe (N = 0.3) | LlamaGen(Stage I) | Ours | 0.2822 | 40.0709 | 0.5350 | 23.84 | x1.00 | x1.00 | 100.00% |
| LlamaGen(Stage I) | Ours | 0.3159 | 24.2459 | 0.5997 | 22.48 | x2.53 | x3.00 | 30.44% | |
| Base | LlamaGen(Stage II) | Base | 0.2822 | 40.0709 | 0.5350 | 23.84 | x1.00 | x1.00 | 100.00% |
| Eagle2 | LlamaGen(Stage II) | Ours | 0.3159 | 23.7103 | 0.6117 | 22.88 | x1.02 | x1.19 | 84.55% |
| Lantern | LlamaGen(Stage II) | Ours | 0.3181 | 23.9510 | 0.5969 | 22.92 | x1.25 | x1.81 | 50.35% |
| Entropy-Lens | LlamaGen(Stage II) | Ours | 0.2966 | 32.3533 | 0.5600 | 22.34 | x1.57 | x2.53 | 39.86% |
| CoDe (N = 0.3) | LlamaGen(Stage II) | Ours | 0.2781 | 36.7520 | 0.5597 | 21.94 | x1.55 | x2.89 | 30.00% |
| Anole | Anole | Ours | 0.3171 | 23.8593 | 0.5970 | 23.14 | x1.87 | x3.29 | 29.88% |
| Base | Anole | Base | 0.3215 | 19.9455 | 0.6544 | 23.52 | x1.00 | x1.00 | 100.00% |
| Eagle2 | Anole | Ours | 0.3159 | 23.7103 | 0.6117 | 22.88 | x1.02 | x1.09 | 91.98% |
| Lantern | Anole | Ours | 0.3181 | 23.9510 | 0.5969 | 22.92 | x1.25 | x1.81 | 50.35% |
| Entropy-Lens | Anole | Ours | 0.2966 | 32.3533 | 0.5600 | 22.34 | x1.57 | x2.53 | 39.86% |
| CoDe (N = 0.3) | Anole | Ours | 0.2781 | 36.7520 | 0.5597 | 21.94 | x1.55 | x2.89 | 30.00% |
- CIAR achieves a 2.18× speed-up and reduces cloud requests by 70% versus state-of-the-art speculative decoding methods.
- CIAR maintains or improves visual fidelity metrics (CLIP, FID, F1, HPSv2) across evaluated models.
- The Inter-Head interval-based uncertainty provides better balance between local token acceptance and cloud offloading than entropy-based or random baselines.
- Interval-enhanced decoding with interval feature conditioning sustains distribution alignment and improves detail coherence.
- A prefix injection strategy reduces unnecessary cloud requests while preserving image quality, with an optimal prefix rate balancing guidance and latency.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。