Skip to main content
QUICK REVIEW

[论文解读] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

Chaofan Lin, Zhenhua Han|arXiv (Cornell University)|May 30, 2024
Digital Rights Management and Security被引用 5
一句话总结

Parrot 引入 Semantic Variable,将应用级信息暴露给公开的 LLM 服务,实现端到端优化,且对基于 LLM 的应用可获得高达 ~11.7x 的加速或 12x 的吞吐量。

ABSTRACT

The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today's public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications. This paper introduces Parrot, an LLM service system that focuses on the end-to-end experience of LLM-based applications. Parrot proposes Semantic Variable, a unified abstraction to expose application-level knowledge to public LLM services. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications. Exposing Semantic Variables to the public LLM service allows it to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. This correlation opens a brand-new optimization space for the end-to-end performance of LLM-based applications. Extensive evaluations demonstrate that Parrot can achieve up to an order-of-magnitude improvement for popular and practical use cases of LLM applications.

研究动机与目标

  • Motivate the need for end-to-end optimization in LLM-based applications beyond per-request metrics.
  • Introduce Semantic Variable as a unified abstraction to expose application structure to LLM services.
  • Demonstrate how application-level knowledge enables data-flow analysis and joint optimizations across requests.
  • Showcase scheduling and caching optimizations that reduce end-to-end latency and increase throughput.

提出的方法

  • Define Semantic Variable as a text region in a prompt with a semantic purpose to connect multiple LLM requests.
  • Represent LLM applications as DAGs of Semantic Variables to reveal data dependencies and enable analysis.
  • Implement a graph-based executor and a set of primitives (GetProducer, GetConsumers, PrefixHash) for inter-request analysis.
  • Develop an application-centric scheduler that groups requests by performance objectives and maximizes prompt-prefix sharing.
  • Design a GPU-efficient attention kernel and shared-prefix optimization to reduce redundant computation.
  • Provide a universal engine abstraction (Fill, Generate, FreeContext) to integrate diverse LLM engines.
(a) Map-Reduce Summary
(a) Map-Reduce Summary

实验结果

研究问题

  • RQ1How can application-level information be exposed to public LLM services to improve end-to-end performance?
  • RQ2What abstractions (Semantic Variable) enable effective data-flow and prompt-structure analyses across multiple LLM requests?
  • RQ3How can scheduling and KV-prefix sharing be leveraged to optimize latency and throughput for LLM-based workflows?
  • RQ4What end-to-end speedups are achievable on real-world LLM applications when applying Parrot’s semantic-variable based optimizations?

主要发现

  • Parrot can achieve up to 11.7x speedup or 12x higher throughput compared with state-of-the-art solutions.
  • Semantic Variables enable just-in-time inter-request analysis that uncovers dependencies and commonalities across requests.
  • Application-centric scheduling and task-grouping reduce end-to-end latency by better balancing latency and throughput across map/reduce-style workflows.
  • Prefix-based sharing of prompt prefixes and optimized attention kernels reduce redundant computation and memory traffic.
  • Empirical evaluations on production and open-source LLM applications demonstrate substantial end-to-end performance gains.
(b) Chain Summary
(b) Chain Summary

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。