QUICK REVIEW

[論文レビュー] The Diminishing Returns of Early-Exit Decoding in Modern LLMs

Rui Wei, Rui Du|arXiv (Cornell University)|Mar 24, 2026

Topic Modeling被引用数 0

ひとこと要約

この論文は早期終了適応スコアとベンチマークを導入し、現代の大規模言語モデルが層別早期終了デコードにどれだけ適しているかを評価し、最新モデルでは早期終了の利得が減少することを示し、モデルファミリーとワークロード全体で早期終了の潜在力に影響を与える要因を分析する。

ABSTRACT

In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model's intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.

研究の動機と目的

現代のLLMが層別早期終了デコードの内在的適性を保持しているかを評価する。
早期終了が出力品質を犠牲にせずどれだけの加速を提供できるかを定量化する。
早期終了の有効性に影響を与える設計・トレーニング・ワークロード要因を特定する。
早期終了手法を実装する前に理論上の最大加速を推定するためのフレームワークを提供する。

提案手法

スキップ比と層-最終層の類似度を組み合わせた早期終了適応スコア（EAS）を定義する。
最適解ベースの早期終了評価を用いたオラクルベンチマークを提案し、最大加速の上限を推定する。
隐状態、ロジット、およびトップKトークンの重複を用いて、退出層間の層-最終類似度を計算する。
アーキテクチャ（Dense、MoE、SSM）およびモデル世代を跨ぐ多様なオープンウェイトLLMを評価する。
モデルスケール、アーキテクチャ、トレーニング、ワークロードが早期終了の潜在力に与える影響を分析する。

Figure 1: Layer-wise early-exit decoding in LLMs.

実験結果

リサーチクエスチョン

RQ1RQ1: 現代のデコーダ専用LLMは層別早期終了に本質的に適しており、層間の類似性は早期終了下でエンドツーエンドの精度を予測できるか。
RQ2RQ2: 規模、アーキテクチャ、トレーニング、ワークロードなど、どの要因がモデルの早期終了支援能力に影響を与えるか。
RQ3RQ3: 現在のモデルとワークロードで早期終了から得られる最大加速はどれくらいか。

主な発見

新しいモデル世代では早期終了の有効性が減少する傾向が見られ、現代のLLMには層の冗長性が低下していることを示唆している。
DenseトランスフォーマーはMixture-of-ExpertsおよびState Space Modelsよりも早期終了の潜在力が高い。
より大きなモデル（特に >20Bパラメータ）は早期終了の潜在力が高い傾向がある。
継続的なプリトレーニングとポストトレーニングの調整は早期終了適性を低下させる傾向がある。
早期終了パターンはモデル固有であり、ワークロードの影響は弱い。

Figure 2: The trend of relative early-exit scores (§ 3.3 ) in recent LLMs and models specifically tuned for early-exit, compared to Llama2-7B . We explain the model selection details in Appendix B .

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。