QUICK REVIEW

[论文解读] Fast Distributed Inference Serving for Large Language Models

Bingyang Wu, Yinmin Zhong|arXiv (Cornell University)|May 10, 2023

Topic Modeling被引用 21

一句话总结

FastServe 引入基于逐 token 级先发制人调度，采用跳-连接多级反馈队列（skip-join MLFQ）以降低 LLM 推理的完结时间（JCT），并辅以主动的 GPU 内存管理和分布式执行支持。它在平均与尾部 JCT 上分别较 Orca 提升最高 5.1x 和 6.4x。

ABSTRACT

Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.

研究动机与目标

推动交互式 LLM 应用（如 ChatGPT）实现较低的作业完成时间（JCT）的动机。
解决面向完成执行的 LLM 服务系统中的队头阻塞问题。
设计面向 LLM 自回归生成的逐 token 级抢占式调度器。
为 KV 缓存开发高效的 GPU 内存管理，以应对内存约束。
通过张量并行和流水线并行，在多 GPU 间支持分布式 LLM 服务。

提出的方法

提出跳-连接多级反馈队列（skip-join MLFQ）调度器，其基于首次迭代时间分配初始优先级，并在令牌级实现迭代级抢占。
在 MLFQ 上扩展跳-连接特性，通过利用半信息无关的知识（输入长度决定首次迭代时间）来减少降级。
实现主动的键值（KV）缓存管理，将不活跃的 KV 张量卸载到主机内存，并预加载所需张量到 GPU 内存，辅以基于调度的卸载/加载。
利用异步内存操作和流水线来使计算与 KV 数据传输重叠。
通过张量并行与流水线并行来支持分布式 LLM 服务，在各 GPUs 之间实现分布式 KV 缓存管理。
在 NVIDIA FasterTransformer 上原型实现 FastServe，并在 16 张 A100 GPU 上以 GPT-3 175B 规模对 Orca 进行评估。

实验结果

研究问题

RQ1在自回归、逐 token 生成过程中，如何对 LLM 推理进行调度以最小化 JCT？
RQ2在输入长度已知但总输出长度未知的半信息无关环境中，跳-连接 MLFQ 调度器是否能有效逼近 SRPT？
RQ3在 LLM 推理中，哪些 KV 缓存管理策略能在 GPU 内存使用和抢占开销之间实现最优平衡？
RQ4FastServe 如何利用张量并行和流水线并行实现多 GPU 的分布式 LLM 推理，在不产生高昂内存或带宽成本的情况下实现扩展？

主要发现

实验表明，FastServe 的平均 JCT 比 Orca 低最多 5.1 倍。
实验表明，FastServe 的尾部（百分位）JCT 比 Orca 低最多 6.4 倍。
通过将首次迭代时间作为预测因子，跳-连接 MLFQ 调度器比朴素 MLFQ 更准确地近似 SRPT。
主动的 KV 缓存交换降低了内存压力，使更多并发作业的排程成为可能，而不会发生灾难性的内存溢出。
通过张量并行和流水线并行实现的分布式执行使像 GPT-3 175B 这类极大模型能够跨多 GPU 扩展。
实验结果显示，在 16 张 NVIDIA A100 GPU 上的 GPT-3 175B 工作负载上，端到端性能得到提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。