QUICK REVIEW

[論文レビュー] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Christopher R. Snell, Jae‐Hoon Lee|arXiv (Cornell University)|Aug 6, 2024

Magnetic confinement fusion research被引用数 17

ひとこと要約

本論文はLLMに対するテスト時計算資源を最適に割り当てる方法を分析し、計算最適戦略がbest-of-Nベースラインを上回ること、FLOPsマッチ設定ではテスト時計算を有効に活用することではるかに大きなモデルを打ち負かせることを示す。

ABSTRACT

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

研究の動機と目的

難易度の高いプロンプトでのLLM出力を改善するために、追加のテスト時計算の活用を動機づける。
提案分布の改良と検証者ベースの探索を、テスト時計算の機構として統合する。
プロンプトごとに適応的に計算資源を配分する計算最適スケーリング戦略を導入する。
FLOPマッチ条件下で、テスト時計算が事前訓練規模とどのように比較されるかを評価する。
追加の事前訓練なしで、テスト時計算を備えた小型モデルが大規模モデルを上回るかを示す。

提案手法

プロンプトに与えられた出力分布の調整としてテスト時計算を扱う、モデル非依存の形式論。
2つの主要な機構を比較する： (i) 逐次生成または並列生成を通じた提案分布の改定、 (ii) プロセスベースの検証モデル（PRM）に対する探索。
モンテカルロローリングの基盤モデルの各ステップの正確性推定を用いて、人間のラベルなしでPRMを訓練する。
PRMに対して3つの探索法を評価する：best-of-N加重、ビーム探索、lookahead探索。
固定された計算予算の下で、特定のプロンプトに対して精度を最大化するようハイパーパラメータを選択する計算最適戦略を定義する。
モデル予測またはオラクル難易度を用いて、難易度ごとの計算割り当てを導くために、プロンプトの難易度を5段階に分類する。

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

実験結果

リサーチクエスチョン

RQ1予算内で精度を最大化するために、プロンプトごとにテスト時計算を最適に割り当てられるか？
RQ2異なるテスト時戦略（ revisions vs. PRMベースの探索）は、プロンプトの難易度と計算予算に対してどのようにスケールするか？
RQ3計算最適なテスト時計算はbest-of-Nベースラインを上回るか、どの程度か？
RQ4FLOPがマッチした設定で、より小さなモデルのテスト時計算がはるかに大きなモデルを上回ることができるか？
RQ5難易度条件付き計算割り当てをテスト時戦略に適用した場合の実用的な利点と制約は何か？

主な発見

計算最適スケーリングは、 revisions および PRM探索を横断してbest-of-Nを約4倍程度少ないテスト時計算で上回ることができる。
PRMベースの探索は難易度依存の有効性を示す。 harder/低予算プロンプトではビーム探索が有効、easy promptsでは予算が大きいとbest-of-Nが上回る。
容易〜中間のプロンプトでは、FLOPsマッチ条件下で特定の条件でテスト時計算が14倍大きいモデルを上回ることがある。
リビジョンベースの提案は、長いリビジョンチェーンで改善され、文脈内の誤りからモデルが学習することを示す。
難易度推定戦略は、プロンプトタイプ全体で最適戦略に近づく、または一致する適応割り当てを可能にする。
探索手法は、検証者のシグナルへの過剰適合のため、予算が増えるにつれて限界収益が減少する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。