QUICK REVIEW

[논문 리뷰] XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Yixin Dong, Charlie F. Ruan|arXiv (Cornell University)|2024. 11. 22.

Topic Modeling인용 수 7

한 줄 요약

XGrammar는 토큰을 맥락 독립 세트와 맥락 의존 세트로 나누고, 적응 캐싱, 지속 스택, 맥락 확장, CPU-GPU 중복 실행으로 이전 방법 대비 대폭 속도를 올린 유연하고 효율적인 CFG 기반 구조화 생성 엔진을 LLM에 도입합니다.

ABSTRACT

The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.

연구 동기 및 목표

LLM 추론에서 JSON, SQL, DSL과 같은 신뢰할 수 있는 구조화된 생성 출력의 필요성을 제기한다.
구조화된 출력의 런타임 오버헤드를 줄이는 CFG 기반 제약 해독 엔진을 제안한다.
맥락 독립 토큰을 사전 계산하고 런타임 검사를 가속하는 기법을 개발한다.
맥락 의존 토큰의 검증 속도를 높이기 위해 지속 스택과 맥락 확장을 만든다.
제한된 생성에서 엔드-투-엔드 속도 향상을 달성하기 위해 LLM 서비스와의 통합을 시연한다.

제안 방법

푸시다운 오토마톤 위치마다 어휘를 맥락 독립 토큰과 맥락 의존 토큰으로 나눈다.
적응형 토큰 마스크 캐시에 맥락 독립 토큰의 유효성을 미리 계산하고 캐시한다.
맥락 확장을 통해 맥락 의존 토큰의 수를 줄이기 위해 문법 맥락을 확장한다.
PDA 상태의 빠른 분기 및 롤백을 가능하게 하는 지속적인 실행 스택을 구현한다.
전체 오버헤드를 최소화하기 위해 마스크 생성과 GPU 기반 LLM 추론을 중첩한다.
엔드투엔드 생성 중 거의 제로 오버헤드를 달성하기 위해 문법 엔진을 LLM 서비스와 공동 설계한다.

Figure 1: Overview of our approach. Our key insight is to divide the vocabulary into context-independent and context-dependent tokens at each position within the pushdown automaton. We precompute and cache the context-independent tokens in an adaptive token mask cache, which is then retrieved at run

실험 결과

연구 질문

RQ1CFG 기반 구조화 생성에 대한 제약 해독을 LLM 추론에서 어떻게 효율적으로 만들 수 있는가?
RQ2CFG 제약에 대한 런타임 토큰 검사를 가장 효율적으로 줄이는 캐시 및 데이터 구조 설계는 무엇인가?
RQ3문맥 확장과 지속 스택이 맥락 의존 토큰 검사을 어느 정도까지 줄일 수 있는가?
RQ4종단 간 LLM 서비스에 배치된 구조화 생성에서 CPU-GPU 중복으로 오버헤드가 얼마나 잘 완화될 수 있는가?
RQ5XGrammar를 기존 LLM 프레임워크에 통합할 때 달성 가능한 엔드투엔드 속도 향상은 무엇인가?

주요 결과

작업	배치 크기	제약 해제	제약 켜기
JSON Schema	1	6.2	6.3
JSON Schema	16	9.0	9.2
CFG (JSON)	1	6.3	6.3
CFG (JSON)	16	9.0	9.1

XGrammar은 CFG 제약 생성에서 현재의 최첨단 방법에 비해 토큰당 지연을 최대 100배 감소시킨다.
LLM 추론 엔진과 결합 시, Llama-3.1 모델에서 엔드투엔드 구조화 생성에 최대 80배의 속도 향상을 달성한다.
적응형 토큰 마스크 캐싱이 맥락 독립 토큰 검사 시간을 토큰당 40 μs 이하로 감소시킨다.
JSON 문법을 사용하는 Llama-3.1에서 맥락 의존 토큰을 약 90% 감소시킨다.
지속 실행 스택은 빠른 상태 분기와 롤백을 가능하게 하여 전처리 속도와 토큰 검사를 향상시킨다.
마스크 생성과 GPU 추론의 중첩은 엔드투엔드 서비스에서 구조화 생성의 오버헤드를 거의 제로로 만든다.

Figure 2: Constrained decoding with per-token mask. The per-token mask prevents LLM from generating tokens that would be invalid according to the structure at that step.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.