QUICK REVIEW

[論文レビュー] Scaling Laws Do Not Scale

Fernando Díaz, Michael Madaio|arXiv (Cornell University)|Jul 5, 2023

Opinion Dynamics and Social Influence被引用数 12

ひとこと要約

モデルの性能とデータサイズまたはパラメータサイズを結ぶスケーリング法則は、評価指標の脆弱性、サブポピュレーションの乖離、社会技術ダイナミクスのため、異なる人間集団を評価する際には脆弱である。人口構成の変化を捉える第三の軸として評価データサイズを拡張提案し、より大きなデータセットがすべてのコミュニティの性能を向上させるとは限らないと警告する。

ABSTRACT

Recent work has advocated for training AI models on ever-larger datasets, arguing that as the size of a dataset increases, the performance of a model trained on that dataset will correspondingly increase (referred to as "scaling laws"). In this paper, we draw on literature from the social sciences and machine learning to critically interrogate these claims. We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output. As the size of datasets used to train large AI models grows and AI systems impact ever larger groups of people, the number of distinct communities represented in training or evaluation datasets grows. It is thus even more likely that communities represented in datasets may have values or preferences not reflected in (or at odds with) the metrics used to evaluate model performance in scaling laws. Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations -- threatening the validity of claims that model performance is improving at scale. We end the paper with implications for AI development: that the motivation for scraping ever-larger datasets may be based on fundamentally flawed assumptions about model performance. That is, models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models. We suggest opportunities for the field to rethink norms and values in AI development, resisting claims for universality of large models, fostering more local, small-scale designs, and other ways to resist the impetus towards scale in AI.

研究の動機と目的

データセットが拡大するにつれて、スケーリング法則が多様なコミュニティ間の性能を信頼性高く予測するかを問う。
評価指標が潜在的構成概念の代理指標であり、集団間で議論の対象となりうるし不安定であることを強調する。
評価集合サイズの増加はその構成を変え、異なる指標好みをもつサブポピュレーションを導入することを主張する。
動的な人口構成を反映するために、スケーリング法則に評価データサイズという第三の軸を導入することを提案する。

提案手法

μ*, μ を定義するための評価指標理論と測定モデリングのレビュー。
スケーリング法則が訓練データサイズを用いて代理指標 μ(U, π(D)) を介して性能 μ(U, π(D)) を推定する方法の分析。
スケーリング法則の文脈における指標の非互換性、非定常性、ステージング、サブタスク、指標の力の議論。
より大きな評価データセットはサブポピュレーションの多様性を高め、普遍的な指標の妥当性を破る可能性があるという主張。
人口構成の変化を捉えるために、スケーリング法則の分析に評価データサイズを第三の軸として追加する提案。

実験結果

リサーチクエスチョン

RQ1評価指標は多様な集団全体にわたって潜在的な性能構成概念を忠実に反映しているのか？
RQ2評価データセットサイズの増加はサブポピュレーションの構成とスケーリング法則の妥当性にどのように影響するか？
RQ3大規模AIシステムの影響を受けるすべてのコミュニティを十分に捉える単一の普遍的指標はあり得るか？
RQ4時とともに生じる社会技術的変化を考慮して、スケーリング法則の分析に評価データサイズ軸を含めるべきか？

主な発見

評価指標は潜在構成概念 μ* の不安定な代理指標であり、サブポピュレーション間で一貫して一致しない可能性がある。
評価データサイズが増えるにつれて、代表されるサブポピュレーションの数が増える傾向があり、指標の解釈を難しくする。
異なるコミュニティは「良い」性能の概念が互換性を欠くことがあり、指標の分岐やユーザーが評価する成果とずれる。
指標は非定常でタスク間でステージングされる可能性があり、社会技術的文脈に強く影響され、普遍的なスケーリング法則を弱める。
大規模な訓練データセットがglobally diverse user baseで展開された場合、普遍的な性能向上をもたらさないことがある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。