QUICK REVIEW

[论文解读] Algorithmic progress in language models

Anson Ho, Tamay Besiroglu|arXiv (Cornell University)|Mar 9, 2024

Natural Language Processing Techniques被引用 11

一句话总结

本论文量化了语言模型预训练中的算法改进如何随时间减少所需计算量，发现有效计算量的中位翻倍时间约为8个月，并且计算量扩展在近年推动了大部分收益。它还评估了Transformer的计算等效收益及算法与硬件扩展角色的演变。

ABSTRACT

We investigate the rate at which algorithms for pre-training language models have improved since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. We estimate augmented scaling laws, which enable us to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, our analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. Though limited by noisy benchmark data, our analysis quantifies the rapid progress in language modeling, shedding light on the relative contributions from compute and algorithms.

研究动机与目标

通过对2012–2023年200余项评估数据集，衡量语言模型预训练中的算法进展速率。
将性能提升分解为来自算法改进、模型规模和数据规模的贡献。
估计有效计算、数据和参数效率的翻倍时间，并比较Transformer与非Transformer架构。
评估Transformer架构如何改变计算效率和总体进展。

提出的方法

拟合扩增的缩放定律，将困惑度与模型规模 N 和数据规模 D 相关联，并引入不可约损失 E，同时考虑随时间的 N_eff 和 D_eff 的指数级进展。
定义有效数据 D_eff = D exp(beta'(Y-Y0)) 和有效模型大小 N_eff = N exp(alpha'(Y-Y0)) 并代入 L = E + A/N_eff^alpha_param + B/D_eff^beta_data。
通过留一交叉验证估计约90个模型变体，以确定最合适的规格（按他们的标准为模型7）。
使用Shapley值风格的分解，将进展归因于数据/参数规模相对算法改进。
通过引入 gamma_T 参数并计算随之而来的可约损失减少，评估Transformer的计算等效收益。
计算翻倍时间 TD = (beta_data/beta_year) ln 2，TN = (alpha_param/alpha_year) ln 2，TC = (1/TN + 1/TD)^-1，以量化进展速率。
进行鲁棒性检验，包括替代规格、自相关控制，以及在数据集（WT103、WT2、PTB）之间的交叉验证。

实验结果

研究问题

RQ1语言模型性能的改进有多大部分来自算法进展，相较于计算、数据和参数的扩展？
RQ2以固定性能目标衡量，算法进展在语言模型预训练中有多快地减少所需计算量？
RQ3Transformer相对于前代架构在计算效率上的贡献有多大？
RQ4模型架构、数据质量和训练技巧如何影响随时间观察到的进展？

主要发现

有效计算的中位翻倍时间为8.4个月（95% CI 4.5–14.3 个月）。
天真外推表明如果算法进展如观测那样持续，自2014年以来性能可能以约22,000×的计算量获得改进，但对此外推需谨慎。
Transformer在前沿计算预算下带来计算等效收益的中位估计为7.2×（95% CI 3.3× 至 45.7×），表明该架构带来了显著的效率改进。
在模型演化对中，计算扩展相对于算法进展的重要性在增加，与自2019年以来对大规模LLM的关注上升相一致。
Shapley分析表明自2014年以来，计算扩展对性能提升的贡献大于算法进展，尽管Transformer和算法进展仍扮演重要角色。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。