QUICK REVIEW

[論文レビュー] XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Yixin Dong, Charlie F. Ruan|arXiv (Cornell University)|Nov 22, 2024

Topic Modeling被引用数 7

ひとこと要約

XGrammar は、トークンを context-independent と context-dependent のセットに分割することで、柔軟で効率的な CFG-based 構造化生成エンジンを LLMs のために導入します。適応キャッシュ、永続スタック、文脈拡張、CPU-GPU のオーバーラップを組み合わせ、従来の手法に比べて大幅なスピードアップを実現します。

ABSTRACT

The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.

研究の動機と目的

LLM 推論における信頼性の高い構造化生成出力（例：JSON、SQL、DSLs）の必要性を動機づける。
構造化出力の実行時オーバーヘッドを低減する CFG-based 制約付きデコードエンジンを提案する。
context-independent トークンを事前計算してキャッシュし、実行時の検査を高速化する手法を開発する。
コンテキスト依存トークンの検証を高速化するために、永続スタックと文脈展開を作成する。
エンドツーエンドの速度向上を実現するために、LLM サービングとの統合をデモンストレーションする。

提案手法

Divide the vocabulary into context-independent and context-dependent tokens at each pushdown automaton position.
Precompute and cache context-independent token validity in an adaptive token mask cache.
Expand grammar context to reduce the number of context-dependent tokens via context expansion.
Implement a persistent execution stack to enable fast branching and rollback of PDA states.
Overlap mask generation with GPU-based LLM inference to minimize overall overhead.
Co-design the grammar engine with LLM serving to achieve near-zero overhead during end-to-end generation.

Figure 1: Overview of our approach. Our key insight is to divide the vocabulary into context-independent and context-dependent tokens at each position within the pushdown automaton. We precompute and cache the context-independent tokens in an adaptive token mask cache, which is then retrieved at run

実験結果

リサーチクエスチョン

RQ1How can constrained decoding for CFG-based structured generation be made efficient in LLM inference?
RQ2What cache and data-structure designs best reduce runtime token checks for CFG constraints?
RQ3To what extent can grammar context expansion and persistent stacks reduce context-dependent token checks?
RQ4How well can CPU-GPU overlap mitigate overhead in structured generation when deployed in end-to-end LLM serving?
RQ5What end-to-end speedups are achievable when integrating XGrammar with existing LLM frameworks?

主な発見

Task	Batch Size	Constraint Off	Constraint On
JSON Schema	1	6.2	6.3
JSON Schema	16	9.0	9.2
CFG (JSON)	1	6.3	6.3
CFG (JSON)	16	9.0	9.1

XGrammar achieves up to 100x reduction in per-token latency for CFG constrained generation versus current state-of-the-art methods.
Combined with an LLM inference engine, XGrammar attains up to 80x speedup in end-to-end structured generation on Llama-3.1 models.
Adaptive token mask caching reduces context-independent token checks to under 40 μs per token.
Context expansion reduces context-dependent tokens by about 90% for JSON grammar on Llama-3.1 with JSON grammar.
A persistent execution stack enables fast state branching and rollback, improving preprocessing speed and token checks.
Overlapping mask generation with GPU inference yields near-zero overhead for structured generation in end-to-end serving.

Figure 2: Constrained decoding with per-token mask. The per-token mask prevents LLM from generating tokens that would be invalid according to the structure at that step.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。