QUICK REVIEW

[논문 리뷰] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

Chaofan Lin, Zhenhua Han|arXiv (Cornell University)|2024. 05. 30.

Digital Rights Management and Security인용 수 5

한 줄 요약

Parrot는 Semantic Variable을 도입하여 애플리케이션 수준 정보를 공개 LLM 서비스에 노출하고, 엔드 투 엔드 최적화를 가능하게 하며 LLM 기반 애플리케이션에 대해 최대 약 11.7배의 속도향상 또는 12배의 처리량 향상을 달성합니다.

ABSTRACT

The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today's public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications. This paper introduces Parrot, an LLM service system that focuses on the end-to-end experience of LLM-based applications. Parrot proposes Semantic Variable, a unified abstraction to expose application-level knowledge to public LLM services. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications. Exposing Semantic Variables to the public LLM service allows it to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. This correlation opens a brand-new optimization space for the end-to-end performance of LLM-based applications. Extensive evaluations demonstrate that Parrot can achieve up to an order-of-magnitude improvement for popular and practical use cases of LLM applications.

연구 동기 및 목표

LLM 기반 애플리케이션에서 요청 단위 지표를 넘어 엔드투엔드 최적화의 필요성을 동기 부여한다.
Semantic Variable을 LLM 서비스에 애플리케이션 구조를 공개하는 통합 추상화로 도입한다.
애플리케이션 수준의 지식이 요청 간 데이터 흐름 분석 및 공동 최적화를 어떻게 가능하게 하는지 시연한다.
엔드투엔드 지연을 줄이고 처리량을 높이는 스케줄링 및 캐싱 최적화를 선보인다.

제안 방법

Semantic Variable을 다수의 LLM 요청을 연결하기 위한 의미론적 목적을 가진 프롬프트의 텍스트 영역으로 정의한다.
LLM 애플리케이션을 데이터 의존성을 드러내고 분석을 가능하게 하기 위해 Semantic Variable의 DAG로 표현한다.
인터-요청 분석을 위한 그래프 기반 실행기와 프리미티브 세트(GetProducer, GetConsumers, PrefixHash)를 구현한다.
성과 목표별로 요청을 묶고 프롬프트-접두사 공유를 최대화하는 애플리케이션 중심의 스케줄러를 개발한다.
중복 계산을 줄이기 위해 GPU-효율적인 어텐션 커널과 공유 프리픽스 최적화를 설계한다.
다양한 LLM 엔진을 통합하기 위한 보편적 엔진 추상화(Fill, Generate, FreeContext)를 제공한다.

실험 결과

연구 질문

RQ1애플리케이션 수준 정보를 공용 LLM 서비스에 노출하여 엔드-투-엔드 성능을 개선하려면 어떻게 해야 하는가?
RQ2여러 LLM 요청에 걸친 효과적인 데이터 흐름 및 프롬프트 구조 분석을 가능하게 하는 추상화( Semantic Variable)는 무엇인가?
RQ3스케줄링 및 KV-prefix 공유를 어떻게 활용하여 LLM 기반 워크플로의 대기시간과 처리량을 최적화할 수 있는가?
RQ4Parrot의 시맨틱-변수 기반 optimizations를 적용할 때 실제 LLM 애플리케이션에서 달성 가능한 엔드투엔드 속도 향상은 무엇인가?

주요 결과

Parrot는 최첨단 솔루션과 비교하여 최대 11.7배의 속도향상 또는 12배의 처리량 향상을 달성할 수 있다.
Semantic Variables는 요청 간 의존성과 공통점을 밝히는 적시(inter-request) 분석을 가능하게 한다.
애플리케이션 중심 스케줄링 및 작업 그룹화가 매핑/축약형 워크플로우에서 대기시간과 처리량의 균형을 더 잘 맞춰 엔드투엔드 대기시간을 줄인다.
프롬프트 접두사의 접두사 기반 공유와 최적화된 어텐션 커널이 중복 계산과 메모리 트래픽을 감소시킨다.
생산 환경 및 오픈 소스 LLM 애플리케이션에 대한 실험적 평가가 실질적인 엔드투엔드 성능 향상을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.