Skip to main content
QUICK REVIEW

[論文レビュー] Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming

Liang Zheng, Bowen Shi|arXiv (Cornell University)|Jan 10, 2026
Parallel Computing and Optimization Techniques被引用数 0
ひとこと要約

tldr: Mosaic is a memory-efficient inference system for diffusion-based LLMs that reduces memory peaks, extends context length dramatically, and improves latency without harming accuracy or speed by using mask-only logits, global memory planning, and lazy chunking.

ABSTRACT

Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement. While these capabilities are particularly advantageous for long-context generation, deploying such models faces a prohibitive memory capacity barrier stemming from severe system inefficiencies. We identify that existing inference systems are ill-suited for this paradigm: unlike autoregressive models constrained by the cumulative KV-cache, dLLMs are bottlenecked by transient activations recomputed at every step. Furthermore, general-purpose memory reuse mechanisms lack the global visibility to adapt to dLLMs' dynamic memory peaks, which toggle between logits and FFNs. To address these mismatches, we propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm. Mosaic integrates a mask-only logits kernel to eliminate redundancy, a lazy chunking optimizer driven by an online heuristic search to adaptively mitigate dynamic peaks, and a global memory manager to resolve fragmentation via virtual addressing. Extensive evaluations demonstrate that Mosaic achieves an average 2.71$ imes$ reduction in the memory peak-to-average ratio and increases the maximum inference sequence length supportable on identical hardware by 15.89-32.98$ imes$. This scalability is achieved without compromising accuracy and speed, and in fact reducing latency by 4.12%-23.26%.

研究の動機と目的

  • Identify why memory is a bottleneck in long-context diffusion LLMs (dLLMs) and how it differs from autoregressive LLMs.
  • Design a memory-efficient inference system tailored to dLLMs that mitigates dynamic memory peaks and fragmentation.
  • Propose techniques to compute logits only for masked tokens and to manage memory globally across the computation graph.
  • Evaluate Mosaic’s impact on memory usage, maximum supportable context length, latency, and accuracy on multiple dLLMs.

提案手法

  • Mask-only logits kernel to compute logits only for masked tokens via a gather-GEMM fused kernel.
  • Graph registrar to define a parameterized computation graph with symbolic dimensions for global visibility.
  • Lazy chunking optimizer with an online bottleneck-driven search to adaptively chunk memory-intensive operators.
  • Global memory manager with a single global reuse plan and VMM-based allocator to eliminate fragmentation.
  • Offline graph construction plus online runtime memory planning to realize minimal sufficient memory configuration.
  • Evaluation on representative dLLMs to measure memory, latency, and context length gains.

実験結果

リサーチクエスチョン

  • RQ1How does memory bottleneck differ between autoregressive LLMs and diffusion-based LLMs in long-context scenarios?
  • RQ2Can mask-only logits and global memory planning reduce memory peaks and fragmentation for dLLMs without hurting latency or accuracy?
  • RQ3What is the impact of dynamic memory peaks on maximum supportable context length, and how can adaptive chunking address it?
  • RQ4How much can context length be extended on identical hardware with Mosaic, and what are the latency implications?

主な発見

  • Average 2.71× reduction in memory peak-to-average ratio.
  • Maximum inference sequence length supported on identical hardware increases by 15.89–32.98×.
  • Latency is reduced by 4.12%–23.26% on average compared to baselines.
  • Contexts beyond native training limits can be supported on three mainstream dLLMs (LLaDA-8B, Dream-7B, LLaDA-MoE).
  • Mask-only logits and global memory management substantially cut memory inflation and fragmentation compared to prior approaches.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。