QUICK REVIEW

[论文解读] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Jovan Stojkovic, Chaojie Zhang|arXiv (Cornell University)|Aug 1, 2024

Advanced Data Storage Technologies被引用 5

一句话总结

DynamoLLM 是一个能源管理框架，使用按请求类型的池、模型并行和 GPU 频率动态重新配置 LLM 推理集群，以在满足 SLO 的同时降低能耗、碳排放和成本。

ABSTRACT

The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.

研究动机与目标

突出当前在高功耗 GPU 上运行的 LLM 推理集群存在的能量低效挑战。
表征 LLM 推理中的异质性和工作负载波动，以发现优化机会。
设计一个自动的动态能源管理框架（DynamoLLM），在 SLO 约束下选择节能配置。
实现频繁、低开销的重配置，以适应需求变化，同时不牺牲服务质量。
在来自一家大型云提供商的真实生产追踪中展示可扩展性和有效性。

提出的方法

对多种模型、请求长度、并行度（TP2/TP4/TP8）和 GPU 频率进行 LLM 能耗性能分析。
将 SLO 下的能耗最小化表述为 MILP 优化，以选择实例数量、并行度和频率。
将优化分解为在不同时间尺度上运行的控制器层次结构（集群、池、实例）。
维护按请求类型的池以减少碎片化并利用输入/输出长度和模型特征的异质性。
将重配置开销建模并应用低开销重配置技术（缓存、后台资源预置、NVLink 传输）。

实验结果

研究问题

RQ1在不同请求类型、模型和 SLO 下，LLM 推理的能耗-性能特征有多么异质？
RQ2自动化的集群管理框架是否能够在满足 LLM 服务的延迟 SLO 的同时降低能耗和成本？
RQ3重配置（扩容、分片、频率变化）带来的开销有哪些，如何将其降至最低？
RQ4分层控制器设计是否能在可接受的开销下可靠地适应动态工作负载？
RQ5接近生产的追踪是否在 DynamoLLM 下实现显著的能耗和碳减排，同时保持服务水平目标？

主要发现

DynamoLLM 相较基线配置节省了 53% 的能量。
DynamoLLM 将运营碳排放降低了38%。
DynamoLLM 在满足延迟 SLO 的同时，降低了客户成本61%。
动态、按请求类型的池和分层控制能够在不同工作负载和 SLO 下实现能源高效运行。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。