QUICK REVIEW

[논문 리뷰] The Unseen AI Disruptions for Power Grids: LLM-Induced Transients

Yuzhuo Li, Mariam Mughees|arXiv (Cornell University)|2024. 09. 09.

Smart Grid Security and Resilience인용 수 5

한 줄 요약

이 논문은 AI 워크로드, 특히 LLMs가 빠르고 매우 일시적인 전력 수요를 야기하는 방식과 전력망 및 데이터 센터에 미치는 영향을 평가하는 모델링 접근법을 논의한다.

ABSTRACT

Recent breakthroughs of large language models (LLMs) have exhibited superior capability across major industries and stimulated multi-hundred-billion-dollar investment in AI-centric data centers in the next 3-5 years. This, in turn, bring the increasing concerns on sustainability and AI-related energy usage. However, there is a largely overlooked issue as challenging and critical as AI model and infrastructure efficiency: the disruptive dynamic power consumption behaviour. With fast, transient dynamics, AI infrastructure features ultra-low inertia, sharp power surge and dip, and a significant peak-idle power ratio. The power scale covers from several hundred watts to megawatts, even to gigawatts. These never-seen-before characteristics make AI a very unique load and pose threats to the power grid reliability and resilience. To reveal this hidden problem, this paper examines the scale of AI power consumption, analyzes AI transient behaviour in various scenarios, develops high-level mathematical models to depict AI workload behaviour and discusses the multifaceted challenges and opportunities they potentially bring to existing power grids. Observing the rapidly evolving machine learning (ML) and AI technologies, this work emphasizes the critical need for interdisciplinary approaches to ensure reliable and sustainable AI infrastructure development, and provides a starting point for researchers and practitioners to tackle such challenges.

연구 동기 및 목표

AI 워크로드의 고유한 전력 및 에너지 동역학을 특히 LLMs와 함께 전력망의 숨겨진 교란으로 강조한다.
일시적 전력 동향을 설명하기 위한 AI 중심 데이터 센터용 고수준 수학 모델을 개발한다.
일시적 전력 현상과 전력망 함의를 설명하기 위해 (훈련, 미세 조정, 추론) 사례 연구를 분석한다.
AI 시대의 전력망 신뢰성, 데이터 센터 설계 및 학제간 기획의 도전 과제와 기회를 논의한다.

제안 방법

AI 부하 특성의 질적 분석을 제공한다(고피크 전력, 빠른 동역학, 버스트성 동작).
P_total 및 P_AI 구성요소를 포함한 AI 중심 데이터 센터에 대한 고수준 수학 모델을 제안한다.
일시적 현상을 포착하기 위해 dP/dt 및 d2P/dt2 항을 포함한 동적 전력 소비 모델을 도입한다.
MIT Supercloud 데이터와 벤치마크 LLM 설정을 사용한 사례 연구를 적용하여 전력 프로필을 설명한다.
AI 부하를 특징짓기 위한 지표로 TDP, GPU 활용도, PUE, Peak/Average, Peak/Idle, 및 dP/dt를 정의하고 사용한다.

Figure 1: Reported energy consumption of training different LLM models with respect to model parameters [ 14 , 22 , 23 , 24 , 25 ] . Note the consumption shown here is relatively positioned, not based on accurate numerical calculation. The exact energy consumption can differ dramatically given diffe

실험 결과

연구 질문

RQ1훈련, 미세 조정, 추론 전반에 걸쳐 AI 워크로드의 뚜렷한 일시적 전력 특성은 무엇인가?
RQ2고수준 수학 모델이 AI 중심 데이터 센터의 동적 전력 동작과 전력망에 미치는 영향을 어떻게 포착할 수 있는가?
RQ3사례 연구(예: MIT Supercloud BERT 작업, GPT2/nanoGPT 설정)가 AI 배치를 위한 전력망 회복력 및 데이터 센터 설계에 대해 어떤 시사점을 제공하는가?
RQ4대규모 AI 컴퓨팅의 전력망 안정성에 대한 함의를 가장 잘 포착하는 지표는 무엇인가?
RQ5전력망의 안정적이고 지속 가능한 작동을 보장하기 위해 AI 인프라를 기획하고 관리하는 데 어떤 기회가 있는가?

주요 결과

AI 워크로드는 빠르고 버스트성 전력 소비를 보이며 고피크 대 평균 비율과 전력망 분배 시스템에 스트레스를 줄 수 있는 상당한 일시적 변화가 있다.
간단한 선형 모델은 불충분하다; 본 논문은 빠른 AI 전력 변화를 포착하기 위해 일차 및 이차 도함수를 포함한 동적이고 고차원적인 전력 모델을 제안한다.
훈련은 AI 가속기를 지속적으로 높은 활용도로 몰아가고 간격 동안 거의 일정한 높은 전력을 나타내며, 추론은 활용도 변화가 광범위하게 나타난다.
사례 연구는 실제 시스템에서의 전력 역학을 보여준다(예: BERT 작업의 피크가 약 50 kW 근처이고 변동성이 두드러짐), 견고한 전력망 대응 계획의 필요성을 시사한다.
이 연구는 AI 중심 데이터 센터 및 전력망 인터페이스를 분석하고 설계하기 위한 프레임워크와 지표(TDP, GPU 활용도, PUE, Peak/Average, Peak/Idle, dP/dt)를 제공한다.

Figure 2: The schematic topology of an AI server with 8 GPUs.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.