QUICK REVIEW

[论文解读] Scaling Laws Do Not Scale

Fernando Díaz, Michael Madaio|arXiv (Cornell University)|Jul 5, 2023

Opinion Dynamics and Social Influence被引用 12

一句话总结

本文认为将模型性能与数据规模或参数规模相关联的扩展定律在评估多样的人类人群时脆弱，原因包括度量脆弱性、子群分化以及社会技术动态。它提出在扩展定律分析中新增评估数据规模作为第三个轴，以捕捉人口构成的变化，并警告更大的数据集不一定会提升所有社区的性能。

ABSTRACT

Recent work has advocated for training AI models on ever-larger datasets, arguing that as the size of a dataset increases, the performance of a model trained on that dataset will correspondingly increase (referred to as "scaling laws"). In this paper, we draw on literature from the social sciences and machine learning to critically interrogate these claims. We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output. As the size of datasets used to train large AI models grows and AI systems impact ever larger groups of people, the number of distinct communities represented in training or evaluation datasets grows. It is thus even more likely that communities represented in datasets may have values or preferences not reflected in (or at odds with) the metrics used to evaluate model performance in scaling laws. Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations -- threatening the validity of claims that model performance is improving at scale. We end the paper with implications for AI development: that the motivation for scraping ever-larger datasets may be based on fundamentally flawed assumptions about model performance. That is, models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models. We suggest opportunities for the field to rethink norms and values in AI development, resisting claims for universality of large models, fostering more local, small-scale designs, and other ways to resist the impetus towards scale in AI.

研究动机与目标

在数据集增长时，质疑扩展定律是否能可靠地预测多样社区的性能。
强调评估指标是潜在构造的代理变量，且在不同人群中可能存在争议或不稳定。
指出增加评估集合的规模会改变其构成，引入具有不同指标偏好的子群体。
提出在扩展定律中增加评估数据规模轴，以反映动态人口构成。

提出的方法

对评估指标理论与测量建模进行综述，以定义构造和代理（μ*，μ）。
分析扩展定律如何利用训练数据规模通过代理 μ(U, π(D)) 推断性能 μ(U, π(D))。
在扩展定律背景下讨论度量不兼容性、非平稳性、阶段性、子任务和度量能力。
论证更大的评估数据集会增加子群体的多样性，可能破坏通用度量的有效性。
提出在扩展定律分析中再增加评估数据规模作为第三个轴，以捕捉人口构成的变化。

实验结果

研究问题

RQ1评估指标是否在多样人群中如实反映潜在性能构造？
RQ2增加评估数据集规模如何影响子群体的构成以及扩展定律的有效性？
RQ3单一的通用指标能否充分捕捉受大规模AI系统影响的所有社区的模型质量？
RQ4扩展定律分析是否应包括评估数据规模轴，以解释随时间变化的社会技术变化？

主要发现

评估指标是潜在构造 μ* 的不稳定代理，且在不同子群体之间可能不一致。
随着评估数据规模的增大，所代表的子群体数量往往增加，导致指标解释变得更加复杂。
不同社区对‘良好’性能的认知可能不兼容，导致度量发散，与用户重视的结果不一致。
度量可以非平稳、跨任务分阶段，并受社会技术语境的强烈影响，从而削弱普适性的扩展定律。
在全球多样化的用户群体上部署时，大规模训练数据集可能不会带来普遍的性能提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。