QUICK REVIEW

[论文解读] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia|arXiv (Cornell University)|Mar 4, 2024

Natural Language Processing Techniques被引用 14

一句话总结

Sarathi-Serve 引入无停滞的、分块预填批处理以提高 LLM 推理吞吐量，同时保持 token 间时间延迟低，在多种模型和 GPU 上相对于 Orca/vLLM 实现了高达 2.6x–6.9x 的加速。

ABSTRACT

Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.

研究动机与目标

推动在在线 LLM 服务中降低吞吐量-延迟的权衡。
开发一个无停滞的调度器，允许新请求在不延迟解码的情况下加入正在进行的批次。
利用分块预填来最大化解码批次的利用率，同时约束每次调度迭代的延迟。
在不同模型和硬件配置上评估吞吐量和延迟。
对关键性能指标（TTFT、TBT、容量）的分块预填影响进行量化。

提出的方法

将现有调度器分为优先预填和优先解码，并指出其陷阱。
介绍 Sarathi-Serve，一种无停滞、按迭代级别的调度器，基于分块预填和解码聚合。
使用分块预填在各迭代之间分割较长的预填工作，同时与正在进行的解码聚合。
为每次调度迭代定义一个令牌预算以约束延迟并最大化吞吐量。
采用混合批处理策略，在保持解码进度的同时机会性地处理预填块。
在多种模型（Mistral-7B、Yi-34B、LLaMA2-70B、Falcon-180B）和硬件（A100、A40）及真实世界轨迹上进行评估。

实验结果

研究问题

RQ1在不同的 SLO 下，Sarathi-Serve 能提供多少吞吐容量，与最先进的调度器相比？
RQ2在延迟和 KV-cache 访问方面，分块预填带来了多少开销？
RQ3与优先预填或优先解码方案相比，无停滞批处理如何影响 TBT 和 TTFT？
RQ4分块预填和无停滞批处理在不同模型规模和硬件配置下的表现如何？

主要发现

在严格和宽松的 SLO 下，Sarathi-Serve 在不同模型和工作负载上持续优于 Orca 和 vLLM。
在单个 A100 上，Mistral-7B 使用 Sarathi-Serve 实现高达 2.6x 的服务容量提升。
Yi-34B 在新调度器下实现高达 2.8x 的容量提升。
在使用 8 个 A100 GPU 的情况下，Falcon-180B 使用 Sarathi-Serve 的容量提升高达 6.9x。
分块预填限制了延迟增长并提高了解码批量吞吐，从而减少生成过程中的卡顿。
总体而言，无停滞批处理在保持高吞吐的同时最小化 TBT 峰值；TTFT 可能由于分块而带来轻微开销。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。