QUICK REVIEW

[论文解读] Vidur: A Large-Scale Simulation Framework For LLM Inference

Amey Agrawal, Nitin Kedia|arXiv (Cornell University)|May 8, 2024

Simulation Techniques and Applications被引用 8

一句话总结

Vidur 是一个高保真度的 LLM 推理仿真器，具备 Vidur-Bench 和 Vidur-Search，可在模型、硬件和工作负载之间实现低成本的部署配置探索。

ABSTRACT

Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours - costing ~218K dollars. Source code for Vidur is available at https://github.com/microsoft/vidur.

研究动机与目标

通过探索并行性、批处理和调度等庞大的配置空间，激发对大语言模型（LLMs）低成本部署的需求。
介绍 Vidur 作为一个高保真度仿真器，能够对跨模型、硬件和工作负载的端到端 LLM 推理性能进行分析与预测。
提供 Vidur-Bench，用于基准化工作负载模式和策略；以及 Vidur-Search，在性能约束下优化部署配置。

提出的方法

将 LLMs 分解为少量的 token-level、sequence-level 和 communication-level 操作符，并对最小输入规模集合进行分析，以构建预测运行时估算器。
对 token-level、sequence-level 和 communication 操作符进行分析，以创建可用于未分析输入的逐操作符运行时模型并进行插值。
使用基于随机森林回归的运行时估算器，从有限的分析数据中预测内核运行时间。
使用可插拔的分层调度器，包含全局、副本和副本阶段组件，以模拟批处理、内存管理和调度策略。
引入 Vidur-Bench，作为一个可扩展的工作负载套件，具有各种模式、调度器和服务框架，以实现保真度和基准测试。
实现 Vidur-Search，通过对每个部署配置和工作负载的最大可持续 QPS 进行二分搜索，来最大化每美元的 QPS。

实验结果

研究问题

RQ1Vidur 是否能够在不同模型、并行化策略和工作负载痕迹下准确预测端到端的 LLM 推理性能？
RQ2工作负载变动如何影响 LLM 推理中的延迟、吞吐量等关键性能指标？
RQ3Vidur-Search 能否在给定工作负载和硬件条件下，识别出符合指定 SLO 的成本效益部署配置？
RQ4与离线、静态场景相比，Vidur 对动态在线工作负载的预测保真度如何？

主要发现

Vidur 能在一系列模型、硬件和痕迹下，以不到 9% 的误差预测请求级 LLM 推理性能。
Vidur-Bench 显示，工作负载特征——例如输入/解码 token 数量和批量大小——会显著影响输出指标。
Vidur-Search 能在极大程度上比基于硬件的探索更快且更便宜地找到近最优的部署配置（例如，对于 LLaMA2-70B），如：1 小时 CPU 与 42K GPU 小时及 $218K 的对比。
分析聚焦于少量的操作符类别（token-level、sequence-level、communication），以实现跨模型的可扩展预测。
该框架在模拟大规模工作负载和痕迹的集群级指标方面显示出高保真度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。