QUICK REVIEW

[论文解读] XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Yixin Dong, Charlie F. Ruan|arXiv (Cornell University)|Nov 22, 2024

Topic Modeling被引用 7

一句话总结

XGrammar 引入了一种灵活高效的基于 CFG 的结构化生成引擎，用于大型语言模型（LLMs），通过将令牌分为上下文无关和上下文相关的集合，结合自适应缓存、持久栈、上下文扩展，以及 CPU-GPU 重叠来实现对先前方法的显著加速。

ABSTRACT

The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.

研究动机与目标

在 LLM 推理中，激励对可靠结构化输出（如 JSON、SQL、DSLs）的需求。
提出一种基于 CFG 的受约束解码引擎，以降低结构化输出的运行时开销。
开发预计算上下文无关令牌并加速运行时检查的技术。
创建持久化栈和上下文扩展，以加速对上下文相关令牌的验证。
展示与 LLM 服务的集成，以在受约束生成中实现端到端加速。

提出的方法

在每个下推自动机位置将词汇表划分为上下文无关和上下文相关令牌。
在自适应令牌掩码缓存中预计算并缓存上下文无关令牌的有效性。
通过上下文扩展来扩展语法上下文，以减少上下文相关令牌的数量。
实现一个持久执行栈，以实现 PDA 状态的快速分支和回滚。
将掩码生成与基于 GPU 的 LLM 推理重叠，以最小化整体开销。
将语法引擎与 LLM 服务共同设计，以在端到端生成过程中实现近零开销。

Figure 1: Overview of our approach. Our key insight is to divide the vocabulary into context-independent and context-dependent tokens at each position within the pushdown automaton. We precompute and cache the context-independent tokens in an adaptive token mask cache, which is then retrieved at run

实验结果

研究问题

RQ1如何在 LLM 推理中使基于 CFG 的结构化生成的受约束解码变得高效？
RQ2哪些缓存和数据结构设计最能减少 CFG 约束的运行时令牌检查？
RQ3语言法上下文扩展和持久栈在多大程度上可以减少对上下文相关令牌的检查？
RQ4在端到端 LLM 服务部署时，CPU-GPU 重叠在多大程度上可以缓解结构化生成的开销？
RQ5将 XGrammar 与现有 LLM 框架集成时，可以实现哪些端到端的加速？

主要发现

任务	批量大小	约束关闭	约束开启
JSON Schema	1	6.2	6.3
JSON Schema	16	9.0	9.2
CFG (JSON)	1	6.3	6.3
CFG (JSON)	16	9.0	9.1

XGrammar 在 CFG 约束生成中的每令牌潜伏延迟相比当前最先进的方法降低最多 100x。
结合 LLM 推理引擎，XGrammar 在 Llama-3.1 模型上的端到端结构化生成实现高达 80x 的加速。
自适应令牌掩码缓存将上下文无关令牌检查时间降低到每个令牌不足 40 μs。
在 Llama-3.1 上使用 JSON 语法时，上下文扩展将上下文相关令牌减少约 90%。
持久执行栈实现快速的状态分支与回滚，提升预处理速度和令牌检查。
将掩码生成与 GPU 推理重叠，在端到端服务中的结构化生成实现近零开销。

Figure 2: Constrained decoding with per-token mask. The per-token mask prevents LLM from generating tokens that would be invalid according to the structure at that step.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。