QUICK REVIEW

[论文解读] The rising costs of training frontier AI models

Ben Cottier, Robi Rahman|arXiv (Cornell University)|May 31, 2024

Machine Learning and Data Classification被引用 18

一句话总结

该论文构建了一个针对前沿 AI 模型训练的详细成本模型，使用三种方法研究摊销成本自 2016 年以来大约以每年 2.4 倍的速率增长，每个模型花费数千万美元，到 2027 年可能超过 10 亿美元；研发人员是总成本的显著组成部分。

ABSTRACT

The costs of training frontier AI models have grown dramatically in recent years, but there is limited public data on the magnitude and growth of these expenses. This paper develops a detailed cost model to address this gap, estimating training costs using three approaches that account for hardware, energy, cloud rental, and staff expenses. The analysis reveals that the amortized cost to train the most compute-intensive models has grown precipitously at a rate of 2.4x per year since 2016 (90% CI: 2.0x to 2.9x). For key frontier models, such as GPT-4 and Gemini, the most significant expenses are AI accelerator chips and staff costs, each costing tens of millions of dollars. Other notable costs include server components (15-22%), cluster-level interconnect (9-13%), and energy consumption (2-6%). If the trend of growing development costs continues, the largest training runs will cost more than a billion dollars by 2027, meaning that only the most well-funded organizations will be able to finance frontier AI models.

研究动机与目标

量化跨多种成本组成的前沿 AI 模型训练成本的上升趋势.
将成本分解为硬件、能源、云租用与研发人员，以理解主要驱动因素.
提供三种估算方法以验证成本增长并评估各方法的鲁棒性.
为前沿 AI 发展可及性与治理提供影响。

提出的方法

三种成本估算方法： (i) 摊销的硬件资本支出 + 最终训练运行的能源成本，(ii) 基于云租用价格的成本，(iii) 包括研发人员和实验在内的选定模型的总模型开发成本。
将硬件拆分为 AI 加速器、服务器、网络与能源，结合折旧与训练芯片小时数来计算摊销成本。
通过 TDP、功率与 TDP 比率，以及数据中心的 PUE，对能源成本建模，使用年度特定的能源费率与厂商数据。
通过将摊销的硬件+能源与基于云的估算进行对比来验证；对折旧和 TPU 包含与否进行敏感性分析。
对 GPT-3、OPT-175B、GPT-4、Gemini Ultra 的跨模型研发成本评估（包括股票与人员结构）。

Figure 1: Amortized hardware cost plus energy cost for the final training run of frontier models. The selected models are among the top 10 most compute-intensive for their time. Amortized hardware costs are the product of training chip-hours and a depreciated hardware cost, with 23% overhead added f

实验结果

研究问题

RQ1前沿模型从 2016 年到现在的摊销训练成本的增长速率是多少？
RQ2前沿模型成本中硬件、能源和研发人员的占比及在开发过程中的变化如何？
RQ3不同成本估算方法在数量级和趋势上有何差异？
RQ4前沿成本上升对 AI 开发的可及性与治理有何影响？

主要发现

Approach	N× increase per year	OOMs/year	Doubling Time (months)	R-squared	N
摊销的硬件资本支出 + 能源	2.4 [2.0, 3.1]	0.39 [0.29, 0.48]	9 [7, 12]	0.61	45
摊销的硬件资本支出 + 能源—不含 TPU	2.9 [2.3, 3.8]	0.47 [0.35, 0.58]	8 [6, 10]	0.77	23
从云端租用	2.6 [2.1, 3.2]	0.41 [0.32, 0.51]	9 [7, 11]	0.68	40

自 2016 年以来，前沿模型的摊销训练成本每年增长 2.4×（95% 置信区间：2.0× 到 3.1×）。
基于云的成本估算显示相近的增长速率，为每年 2.6×（95% 置信区间：2.1× 到 3.2×）。
对于知名模型（GPT-3、OPT-175B、GPT-4、Gemini Ultra），研发人员成本（包括股权）占总摊销开发成本的 29%–49%；硬件成本占 47%–65%，能源 2%–6%（不含股权：研发 22–33%、硬件 60–74%、能源 2–7%）。
公开宣布的最终训练成本高达数千万美元（GPT-4：约 4000 万美元；Gemini Ultra：约 3000 万美元）；按当前增长率推算，到 2027 年成本可能超过 10 亿美元。
收购成本（前期硬件）比摊销成本高出 1–2 次数量级，凸显进入门槛的资本障碍。
平均而言，约 44% 的摊销的硬件+能源成本用于 AI 加速芯片，29% 为服务器，17% 为互联电纤；能源成本占比仍较小但在增长。

Figure 2: (Reproduction of Figure 1 for convenience.) Amortized hardware cost plus energy cost for the final training run of frontier models. The selected models are among the top 10 most compute-intensive for their time. Amortized hardware costs are the product of training chip-hours and a deprecia

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。