QUICK REVIEW

[論文レビュー] Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure

Shuhui Qu|arXiv (Cornell University)|Feb 3, 2026

Natural Language Processing Techniques被引用数 0

ひとこと要約

要約: 本論文は、固定の検証予算の下で精度を向上させるために、中間段階で検証者コールをゲートし、スコア付けし、適応的に割り当てる状態レベルの選択的検証フレームワークを提示し、MATH におけるソリューションレベルのベースラインを上回る。

ABSTRACT

Test-time computation has become a primary driver of progress in large language model (LLM) reasoning, but it is increasingly bottlenecked by expensive verification. In many reasoning systems, a large fraction of verifier calls are spent on redundant or unpromising intermediate hypotheses. We study reasoning under a \emph{verification-cost-limited} setting and ask how verification effort should be allocated across intermediate states. We propose a state-level selective verification framework that combines (i) deterministic feasibility gating over a structured move interface, (ii) pre-verification ranking using a hybrid of learned state-distance and residual scoring, and (iii) adaptive allocation of verifier calls based on local uncertainty. Unlike solution-level best-of-$N$ or uniform intermediate verification, our method distributes verification where it is most informative. On the extsc{MATH} benchmark, our approach achieves higher accuracy than best-of-$N$, majority voting, and beam search while using 44\% fewer verifier calls.

研究の動機と目的

検証コスト制限下で検証者コールが主要コストとなる推論を動機づける。
中間状態で絞り込み、スコア付け、割り当てを行う3段階のゲート付き競争パイプラインを開発する。
検証ラベル付き候補リストからの軽量残差スコアラーを訓練し、検証を優先する。
Best-of-N、マジョリティ投票、ビームサーチと比較して MATH での精度–コストトレードオフを改善を示す。

提案手法

検証呼び出しを行わずに無効な手を排除するための構造化移動インターフェース上の決定論的実現可能性ゲートを導入する。
学習された構造的距離と検証ラベルから訓練された残差を組み合わせたハイブリッド事前検証スコアラーを開発する。
各状態でいくつの手を検証するかを決定する局所的不確実性代理を用いた状態条件付き検証割り当てを実装する。
検証ラベル付き候補リスト上の同状態内ランキング損失を用いて残差スコアラーを訓練し、必要に応じ軌跡ベースの費用-to-go 信号を組み込む。
固定された検証呼び出し予算の下で MATH を評価し、Best-of-N、マジョリティ投票、ビームサーチと比較する。

実験結果

リサーチクエスチョン

RQ1中間状態での検証呼び出しの割り当ては、ソリューションレベル戦略と比較して固定検証予算の下で精度を改善できるか。
RQ2決定論的実現可能性ゲートと状態局所的不確実性に基づく割り当ては、多段階の記号的推論タスクでより良い精度–効率のトレードオフを生むか。
RQ3予算制約下でのトップ-k 検証のための実現可能な手をランク付けする学習済みの事前検証残差スコアラーはどの程度有効か。
RQ4バックボーンモデルの強さと提案された割り当て戦略が精度–コストのフロンティアに与える影響はどの程度か。

主な発見

Method	Verifier calls ↓	Acc (%) ↑
0-shot CoT	-	30.6
Best-of-N (N=64)	64	42.4
Majority Vote (N=64)	64	44.6
Beam Search (b=4, N=64)	64	51.8
Ours (gates + hybrid + state-k)	44.8	55.2

本手法は MATH で 44.8 回の検証呼び出しで 55.2% の精度を達成し、同じノミナル予算で Best-of-N (42.4%)、Majority Voting (44.6%)、Beam Search (51.8%) を上回る。
ゲーティングだけで検証呼び出しを削減し精度を向上させる。D_type スコアリングを追加すると、呼び出しを減らしつつ精度をさらに向上させる。
適応的な状態条件付き検証割り当ては最大の効果を生み、44.8 の検証呼び出しで 55.2% の精度を達成し、局所的不確実性を考慮した予算配分の価値を示す。
予算を超えずに中間状態での割り当ては、ソリューションレベルのベースラインより常により良い精度を与え、バックボーンのスケーリングは性能をさらに向上させる。
バックボーンのアップグレード（例：Llama 3.2 3B）は予算を問わず精度を向上させ、割り当てメカニズムを部分的に補完し、高予算でより大きなモデルベースラインに近づく。

Figure 2: Budget-matched comparison across inference strategies. Accuracy on MATH-500 versus number of generations per problem $N$ (x-axis). We report Majoritiy voting, solution-level Best-of- $N$ (weighted), Beam search ( $b{=}4$ ), and our intermediate-state allocation method.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。