[論文レビュー] Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming
tldr: Mosaic is a memory-efficient inference system for diffusion-based LLMs that reduces memory peaks, extends context length dramatically, and improves latency without harming accuracy or speed by using mask-only logits, global memory planning, and lazy chunking.
Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement. While these capabilities are particularly advantageous for long-context generation, deploying such models faces a prohibitive memory capacity barrier stemming from severe system inefficiencies. We identify that existing inference systems are ill-suited for this paradigm: unlike autoregressive models constrained by the cumulative KV-cache, dLLMs are bottlenecked by transient activations recomputed at every step. Furthermore, general-purpose memory reuse mechanisms lack the global visibility to adapt to dLLMs' dynamic memory peaks, which toggle between logits and FFNs. To address these mismatches, we propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm. Mosaic integrates a mask-only logits kernel to eliminate redundancy, a lazy chunking optimizer driven by an online heuristic search to adaptively mitigate dynamic peaks, and a global memory manager to resolve fragmentation via virtual addressing. Extensive evaluations demonstrate that Mosaic achieves an average 2.71$ imes$ reduction in the memory peak-to-average ratio and increases the maximum inference sequence length supportable on identical hardware by 15.89-32.98$ imes$. This scalability is achieved without compromising accuracy and speed, and in fact reducing latency by 4.12%-23.26%.
研究の動機と目的
- Identify why memory is a bottleneck in long-context diffusion LLMs (dLLMs) and how it differs from autoregressive LLMs.
- Design a memory-efficient inference system tailored to dLLMs that mitigates dynamic memory peaks and fragmentation.
- Propose techniques to compute logits only for masked tokens and to manage memory globally across the computation graph.
- Evaluate Mosaic’s impact on memory usage, maximum supportable context length, latency, and accuracy on multiple dLLMs.
提案手法
- Mask-only logits kernel to compute logits only for masked tokens via a gather-GEMM fused kernel.
- Graph registrar to define a parameterized computation graph with symbolic dimensions for global visibility.
- Lazy chunking optimizer with an online bottleneck-driven search to adaptively chunk memory-intensive operators.
- Global memory manager with a single global reuse plan and VMM-based allocator to eliminate fragmentation.
- Offline graph construction plus online runtime memory planning to realize minimal sufficient memory configuration.
- Evaluation on representative dLLMs to measure memory, latency, and context length gains.
実験結果
リサーチクエスチョン
- RQ1How does memory bottleneck differ between autoregressive LLMs and diffusion-based LLMs in long-context scenarios?
- RQ2Can mask-only logits and global memory planning reduce memory peaks and fragmentation for dLLMs without hurting latency or accuracy?
- RQ3What is the impact of dynamic memory peaks on maximum supportable context length, and how can adaptive chunking address it?
- RQ4How much can context length be extended on identical hardware with Mosaic, and what are the latency implications?
主な発見
- Average 2.71× reduction in memory peak-to-average ratio.
- Maximum inference sequence length supported on identical hardware increases by 15.89–32.98×.
- Latency is reduced by 4.12%–23.26% on average compared to baselines.
- Contexts beyond native training limits can be supported on three mainstream dLLMs (LLaDA-8B, Dream-7B, LLaDA-MoE).
- Mask-only logits and global memory management substantially cut memory inflation and fragmentation compared to prior approaches.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。