QUICK REVIEW

[論文レビュー] Unified Scaling Laws for Routed Language Models

Aidan Clark, Diego de las Casas|arXiv (Cornell University)|Feb 2, 2022

Topic Modeling被引用数 23

ひとこと要約

Routing Networks がパラメータ数と計算量を切り離すスケーリング法を導出し、3つの routing テクニックを分析し、 routed と dense モデルを比較するための Effective Parameter Count を導入します。Routing はサイズを問わず性能を向上させることを示し、いつどのように routing を使うべきかの指針を提供します。

ABSTRACT

The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.

研究の動機と目的

routing アーキテクチャが多数のオーダーオブマグニチュードにわたって言語モデルでどのようにスケールするかを調査する。
エキスパート数と dense モデルサイズに対する性能のスケーリングを特徴付ける。
スケーリング法を異なる routing テクニックと計算関連変数へ一般化する。
Effective Parameter Count の概念を導入して dense と routed モデルを統一する。

提案手法

Study three routing techniques: Sinkhorn-base sparse MoE (s-base), input-based deterministic hash routing (hash), and routing via reinforcement learning (rl-r).
Propose scaling laws where log loss is bilinear in log model size N and log of a saturating function of E (experts).
Generalize scaling to inference compute F and parameter count P using variables F and B = P/F.
Fit the proposed laws to empirical data from models up to 200B parameters across multiple E values.
Introduce an E-saturation transformation to bound scaling in E and enable cross-architecture comparison.
Demonstrate an Effective Parameter Count that maps routed models to dense models with equivalent performance.

実験結果

リサーチクエスチョン

RQ1routing アーキテクチャは experts の数と dense パラメータ数を変化させたときに言語モデルでどのようにスケールするか？
RQ2異なる routing テクニックは共通のスケーリング法に従うのか、係数はどう比較されるか？
RQ3推論計算量 F とパラメータ数 P の形で統一的な法則を表現してアーキテクチャ間の一般化を可能にできるか？
RQ4スケールの観点で routed 対 dense を比較する意味のある指標は何か（Effective Parameter Count）？

主な発見

Routing は tested なすべてのテクニックに対してモデルサイズと変動を超えて性能を向上させる。
RL ベースの routing (rl-r) は、過去の懸念にもかかわらず最先端技術と同程度の有効性を示す。
スケーリング法は bilinear (N,E) または kappa 変換された変数で routing ネットワークを正確に描述し、(F,B) 表現にも拡張される。
An Effective Parameter Count (epc) が routing モデルを dense モデルへマッピングし、単一のべき法則の下で性能を統一する。
s-base は一貫して rl-r および hash よりもスケーリングで優れており、N が大きくなると利得が縮小する。
Ncutoff の閾値を超えると routing による性能向上は得られず、epcmax は token exposure とともに増加する。閾値は Emax とデータセットサイズに依存する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。