QUICK REVIEW

[論文レビュー] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Jovan Stojkovic, Chaojie Zhang|arXiv (Cornell University)|Aug 1, 2024

Advanced Data Storage Technologies被引用数 5

ひとこと要約

DynamoLLMは、リクエスト種別ごとのプール、モデル並列性、およびGPU周波数を用いてLLM推論クラスタを動的に再構成し、SLOを満たしつつエネルギー・炭素排出量・コストを削減するエネルギー管理フレームワークである。

ABSTRACT

The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.

研究の動機と目的

高消費電力GPU上で動作する現代のLLM推論クラスタにおけるエネルギー非効率の課題を浮き彫りにする。
LLM推論における異質性とワークロードの変動を特徴付け、最適化機会を特定する。
SLO制約の下でエネルギー効率の高い構成を選択する自動的で動的なエネルギー管理フレームワーク（DynamoLLM）を設計する。
サービス品質を損なうことなく、需要の変化に適応する頻繁で低オーバーヘッドの再構成を可能にする。
主要なクラウドプロバイダの実際の本番トレースでスケーラビリティと有効性を実証する。

提案手法

複数のモデル、要求長、並列性（TP2/TP4/TP8）、およびGPU周波数に対するLLMのエネルギー性能をプロファイリングする。
SLO下でのエネルギー最小化を、インスタンス数、並列性、周波数を選択するMILP最適化として定式化する。
異なる時間スケールで動作するコントローラの階層（クラスター、プール、インスタンス）に最適化を分解する。
入力/出力の長さやモデル特性の異質性を活用し、断片化を減らすためにリクエスト種別ごとのプールを維持する。
再構成のオーバーヘッドモデルを組み込み、低オーバーヘッドの再構成手法（キャッシュ、バックグラウンドプロビジョニング、NVLink転送）を適用する。

実験結果

リサーチクエスチョン

RQ1リクエスト種別、モデル、SLOを横断したLLM推論のエネルギー性能プロファイルはどれほど異質ですか？
RQ2LLM提供のための待ち時間SLOを満たしつつ、自動的なクラスタ管理フレームワークはエネルギーとコストを削減できますか？
RQ3再構成（スケーリング、シャーディング、周波数変更）のオーバーヘッドはどの程度で、どのように最小化できますか？
RQ4階層型コントローラ設計は、受け入れ可能なオーバーヘッドで動的なワークロードに信頼性高く適応しますか？
RQ5本番環境に近いトレースで、DynamoLLMはサービスレベル目標を維持しつつ、エネルギーと炭素排出をかなり削減しますか？

主な発見

DynamoLLMはベースライン構成と比較してエネルギーを53%節約する。
DynamoLLMは運用時の炭素排出を38%削減する。
DynamoLLMは待機遅延SLOを満たしつつ、顧客コストを61%削減する。
動的なリクエスト種別ごとのプールと階層型制御により、変動するワークロードとSLOの下でエネルギー効率の高い運用を実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。