QUICK REVIEW

[论文解读] The Unseen AI Disruptions for Power Grids: LLM-Induced Transients

Yuzhuo Li, Mariam Mughees|arXiv (Cornell University)|Sep 9, 2024

Smart Grid Security and Resilience被引用 5

一句话总结

这篇论文分析了AI工作负载，特别是LLMs，如何引起快速且高度瞬态的电力需求，并讨论评估其对电网和数据中心影响的建模方法。

ABSTRACT

Recent breakthroughs of large language models (LLMs) have exhibited superior capability across major industries and stimulated multi-hundred-billion-dollar investment in AI-centric data centers in the next 3-5 years. This, in turn, bring the increasing concerns on sustainability and AI-related energy usage. However, there is a largely overlooked issue as challenging and critical as AI model and infrastructure efficiency: the disruptive dynamic power consumption behaviour. With fast, transient dynamics, AI infrastructure features ultra-low inertia, sharp power surge and dip, and a significant peak-idle power ratio. The power scale covers from several hundred watts to megawatts, even to gigawatts. These never-seen-before characteristics make AI a very unique load and pose threats to the power grid reliability and resilience. To reveal this hidden problem, this paper examines the scale of AI power consumption, analyzes AI transient behaviour in various scenarios, develops high-level mathematical models to depict AI workload behaviour and discusses the multifaceted challenges and opportunities they potentially bring to existing power grids. Observing the rapidly evolving machine learning (ML) and AI technologies, this work emphasizes the critical need for interdisciplinary approaches to ensure reliable and sustainable AI infrastructure development, and provides a starting point for researchers and practitioners to tackle such challenges.

研究动机与目标

突出AI工作负载，尤其是LLMs，在电网中作为潜在隐藏性中断的独特电力与能量动态。
为以AI为中心的数据中心开发高层次的数学模型，以描述瞬态电力行为。
分析案例研究（训练、微调、推理）以说明瞬态电力现象及对电网的影响。
讨论在AI时代中电网可靠性、数据中心设计以及跨学科规划的挑战与机遇。

提出的方法

给出AI负载特征的定性分析（高峰值功率、快速动态、爆发性行为）。
提出包含 P_total 和 P_AI 分量的面向AI中心数据中心的高层次数学模型。
引入带有 dP/dt 和 d2P/dt2 项的动态功耗模型以捕捉瞬态。
应用使用MIT Supercloud数据和基准LLM设置的案例研究来说明功率轮廓。
定义并使用如 TDP、GPU 利用率、PUE、Peak/Average、Peak/Idle，以及 dP/dt 等指标来表征AI负载。

Figure 1: Reported energy consumption of training different LLM models with respect to model parameters [ 14 , 22 , 23 , 24 , 25 ] . Note the consumption shown here is relatively positioned, not based on accurate numerical calculation. The exact energy consumption can differ dramatically given diffe

实验结果

研究问题

RQ1AI工作负载在训练、微调和推理过程中的特征性瞬态功率特性是什么？
RQ2如何用高层次的数学模型捕捉AI中心数据中心的动态功率行为及其对电网的影响？
RQ3哪些案例研究（如MIT Supercloud BERT作业、GPT2/nanoGPT设置）对AI部署的电网韧性与数据中心设计提供了哪些见解？
RQ4哪些指标最能捕捉大规模AI计算对电网稳定性的潜在影响？
RQ5在规划和管理AI基础设施以确保电网可靠与可持续运行方面存在哪些机会？

主要发现

AI工作负载表现出快速、爆发性的功耗，具有高峰值/平均比和显著的瞬态，可能对配电系统造成压力。
简单的线性模型不足；本文提出动态的高阶功率模型（包括一阶和二阶导数）来捕捉AI功率的快速变化。
训练可能使AI加速器持续高利用率，在间隔内功率近似恒定，而推理则呈现更广泛的利用率变化。
案例研究在真实系统中展示电力动态（例如BERT作业峰值接近50 kW且有显著波动），说明需要鲁棒的面向电网的规划。
该工作提供了一个框架和指标（TDP、GPU 利用率、PUE、Peak/Average、Peak/Idle、dP/dt）来分析和设计面向AI的数据中心及电网接口。

Figure 2: The schematic topology of an AI server with 8 GPUs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。