[论文解读] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training
Zeus 是一个在线优化框架,联合调整批大小和 GPU 功率上限,以在重复的 DNN 训练任务中最小化综合能耗-时间成本,使用基于 Thompson Sampling 的 MAB 和即时能量分析器。
Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose Zeus, an optimization framework to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 15.3%-75.8% for diverse workloads.
研究动机与目标
- Motivate reducing energy consumption in DNN training while acknowledging tradeoffs with performance.
- Characterize how batch size and GPU power limit affect energy efficiency and time-to-accuracy across workloads.
- Develop an online, data-drift-aware optimizer that requires no offline profiling and adapts to recurrent training jobs.
- Enable integration with existing DNN workflows with minimal code changes.
提出的方法
- Characterize energy vs. performance tradeoffs for DNN training across batch sizes and GPU power limits.
- Formulate the optimization objective as a cost C(b,p;η) combining ETA and TTA with a user-defined η.
- Decouple optimization of batch size and power limit to reduce search space while preserving optimality.
- Use a just-in-time energy profiler to online-profile AvgPower(b,p) and Throughput(b,p) for power limits given a batch size.
- Apply Gaussian Thompson Sampling as the MAB policy to select batch sizes across recurrences and update beliefs with observed costs.
- Prune and adapt exploration for data drift and concurrent jobs using a windowed cost variance and non-stationary handling.
实验结果
研究问题
- RQ1How do batch size and GPU power limit jointly influence energy consumption and training time for DNNs?
- RQ2Can an online, workload-adaptive optimizer minimize a combined energy-time cost without offline profiling?
- RQ3How does Zeus handle stochastic training, data drift, and concurrent recurrent jobs in production clusters?
主要发现
- Zeus reduces energy consumption by 15.3%–75.8% across diverse workloads compared to baseline max batch size and max power limit.
- Training time is reduced by 60.6% relative to baseline configurations.
- Zeus adapts quickly to data drift and supports multi-GPU settings.
- The optimizer converges to near-optimal configurations with online profiling and Thompson Sampling.
- Just-in-time profiling yields negligible overhead while avoiding expensive offline measurements.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。