QUICK REVIEW

[論文レビュー] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia|arXiv (Cornell University)|Mar 4, 2024

Natural Language Processing Techniques被引用数 14

ひとこと要約

Sarathi-Serveは stall-free（停滞なし）、chunked-prefill バッチ処理を導入してLLM推論のスループットを向上させつつ、トークン間遅延を低く維持します。複数のモデルとGPUにわたり Orca/vLLM に対して最大で 2.6x–6.9x のスピードアップを実現します。

ABSTRACT

Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.

研究の動機と目的

オンラインLLMサービングにおけるスループットとレイテンシのトレードオフを緩和する動機づけ。
新規リクエストがデコードを遅らせることなく進行中のバッチに参加できる停滞なしスケジューラの開発。
チャンク前詰めを活用してデコードバッチの利用率を最大化しつつ、反復ごとの遅延を制 bounded にする。
多様なモデルとハードウェア構成でのスループットとレイテンシの評価。
チャンク前詰めが主要なパフォーマンス指標（TTFT、TBT、容量）に与える影響の定量化。

提案手法

既存のスケジューラを前詰め優先とデコード優先に分類し、それらの問題点を特定する。
chunked-prefillsとデコード協調を核とする停滞なし、反復レベルのスケジューラであるSarathi-Serveを提案。
長い前詰め作業を反復に分割しつつ、進行中のデコードと共融させる。
遅延を制限しスループットを最大化するために、各スケジューリング反復ごとのトークン予算を定義する。
デコードの進行を維持しつつ前詰めチャンクを機を見て処理するハイブリッドバッチング戦略を採用。
実世界のトレースを用いた複数モデル（Mistral-7B、Yi-34B、LLaMA2-70B、Falcon-180B）とハードウェア（A100s、A40s）での評価。

実験結果

リサーチクエスチョン

RQ1さまざまなSLOの下で、最先端のスケジューラと比較してSarathi-Serveはどれだけのスループット容量を提供できるか。
RQ2チャンク前詰めによって生じるレイテンシとKVキャッシュアクセスのオーバーヘッドはどの程度か。
RQ3停滞なしバッチ処理は、前詰め優先・デコード優先の方式と比較してTTFTとTBTへどのような影響を与えるか。
RQ4チャンク前詰めと停滞なしバッチ処理は、モデルサイズとハードウェア構成の異なる場合でどう性能が異なるか。

主な発見

Sarathi-Serveは、厳格なSLOと緩いSLOの両方で、モデルとワークロードを跨いでOrcaとvLLMを一貫して上回る。
Mistral-7Bは単一のA100でSarathi-Serveを用いて最大2.6xのサービス容量を達成。
Yi-34Bは新しいスケジューラ下で最大2.8xの容量を達成。
Falcon-180Bは8枚のA100 GPUを用いてSarathi-Serveで最大6.9xの容量を示す。
Chunked-prefillsはレイテンシの成長を抑制し、デコードバッチスループットを向上させ、生成の停滞を減らす。
全体として、停滞なしバッチ処理はTBTの急上昇を最小化しつつ高いスループットを維持するが、チャンク処理によりTTFTにはわずかなオーバーヘッドが生じる可能性がある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。