QUICK REVIEW

[論文レビュー] Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming

Liang Zheng, Bowen Shi|arXiv (Cornell University)|Jan 10, 2026

Parallel Computing and Optimization Techniques被引用数 0

ひとこと要約

tldr: Mosaic is a memory-efficient inference system for diffusion-based LLMs that reduces memory peaks, extends context length dramatically, and improves latency without harming accuracy or speed by using mask-only logits, global memory planning, and lazy chunking.

ABSTRACT

Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement. While these capabilities are particularly advantageous for long-context generation, deploying such models faces a prohibitive memory capacity barrier stemming from severe system inefficiencies. We identify that existing inference systems are ill-suited for this paradigm: unlike autoregressive models constrained by the cumulative KV-cache, dLLMs are bottlenecked by transient activations recomputed at every step. Furthermore, general-purpose memory reuse mechanisms lack the global visibility to adapt to dLLMs' dynamic memory peaks, which toggle between logits and FFNs. To address these mismatches, we propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm. Mosaic integrates a mask-only logits kernel to eliminate redundancy, a lazy chunking optimizer driven by an online heuristic search to adaptively mitigate dynamic peaks, and a global memory manager to resolve fragmentation via virtual addressing. Extensive evaluations demonstrate that Mosaic achieves an average 2.71$ imes$ reduction in the memory peak-to-average ratio and increases the maximum inference sequence length supportable on identical hardware by 15.89-32.98$ imes$. This scalability is achieved without compromising accuracy and speed, and in fact reducing latency by 4.12%-23.26%.

研究の動機と目的

Identify why memory is a bottleneck in long-context diffusion LLMs (dLLMs) and how it differs from autoregressive LLMs.
Design a memory-efficient inference system tailored to dLLMs that mitigates dynamic memory peaks and fragmentation.
Propose techniques to compute logits only for masked tokens and to manage memory globally across the computation graph.
Evaluate Mosaic’s impact on memory usage, maximum supportable context length, latency, and accuracy on multiple dLLMs.

提案手法

Mask-only logits kernel to compute logits only for masked tokens via a gather-GEMM fused kernel.
Graph registrar to define a parameterized computation graph with symbolic dimensions for global visibility.
Lazy chunking optimizer with an online bottleneck-driven search to adaptively chunk memory-intensive operators.
Global memory manager with a single global reuse plan and VMM-based allocator to eliminate fragmentation.
Offline graph construction plus online runtime memory planning to realize minimal sufficient memory configuration.
Evaluation on representative dLLMs to measure memory, latency, and context length gains.

実験結果

リサーチクエスチョン

RQ1How does memory bottleneck differ between autoregressive LLMs and diffusion-based LLMs in long-context scenarios?
RQ2Can mask-only logits and global memory planning reduce memory peaks and fragmentation for dLLMs without hurting latency or accuracy?
RQ3What is the impact of dynamic memory peaks on maximum supportable context length, and how can adaptive chunking address it?
RQ4How much can context length be extended on identical hardware with Mosaic, and what are the latency implications?

主な発見

Average 2.71× reduction in memory peak-to-average ratio.
Maximum inference sequence length supported on identical hardware increases by 15.89–32.98×.
Latency is reduced by 4.12%–23.26% on average compared to baselines.
Contexts beyond native training limits can be supported on three mainstream dLLMs (LLaDA-8B, Dream-7B, LLaDA-MoE).
Mask-only logits and global memory management substantially cut memory inflation and fragmentation compared to prior approaches.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。