QUICK REVIEW

[論文レビュー] interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors

Vishak K Bhat, Prateek Chanda|arXiv (Cornell University)|Feb 5, 2026

Formal Methods in Verification被引用数 0

ひとこと要約

本論文は interwhen を提案する。テスト時の検証フレームワークで、メタプロンプト付きの検証可能状態を用いて推論 traces を指針づける。自己検証または外部検証による steering を可能にし、正しさを損なうことなく正確性と効率性を向上させる。

ABSTRACT

Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification increasingly important for ensuring correctness. Existing approaches either verify only the final answer, which misses early errors, or rely on branch-and-verify strategies that explore multiple trajectories at substantially higher compute cost. We introduce interwhen, a single-trajectory verification framework that steers model behavior by providing feedback on intermediate verifiable properties. Our method addresses two key challenges. First, extracting intermediate solutions from a reasoning trace typically requires prompt engineering or external task decomposition into fixed steps, which can constrain the model's reasoning strategy. Instead, we periodically poll the reasoning trace and fork inference to recover intermediate solutions without imposing any predefined structure. Second, frequent verifier calls can increase latency; we address this by running verifiers asynchronously and interrupting the main trajectory only when an error is detected, leaving generation unaffected otherwise. This design improves both reliability and efficiency, and naturally supports early stopping based on consistency over recent intermediate solutions. Across benchmarks in code generation and arithmetic, logical and spatial reasoning, interwhen improves accuracy by up to 15 percentage points over standard chain-of-thought execution while staying within 1.5x of token compute cost. Moreover, on every dataset, interwhen achieves a Pareto-optimal operating point between accuracy and efficiency compared to existing test-time verification methods. Code is available at https://github.com/microsoft/interwhen.

研究の動機と目的

法・金融・実世界など高リスク領域における最終答え以外の言語モデル出力の検証の意義を動機づける。
モデル外で問題解決を分解せず、テスト時に部分的推論 traces を検証・誘導する一般的なフレームワークを提案する。
verifiable intermediate states を抽出するためにメタプロンプティングを導入し、自己検証または外部検証者による検証を可能にする。
partial traces への介入が、複数データセットで効率性（早期終了）または正確性（テスト時スケーリング）を改善しうることを示す。

提案手法

検証可能状態を中間推論状態として定義し、出力を検出・抽出しやすくするためにメタプロンプティングを用いて構造化する。
任意の検証ベースまたは自己検証戦略を実装するための3つの中核操作：extract_state、verify、intervene。
状態が失敗した場合に検証者フィードバックをインラインで付け加える single trace を維持する sequential verifier アルゴリズムを提示し、適応的な修正を可能にする。
ケーススタディ：早期停止のための内部検証（k-Stable Answer）と、構造化プロンプトと検証者を用いたテスト時スケーリングの外部検証。
設計上、外部検証者を使用する場合の健全性が保証されることを示す。

実験結果

リサーチクエスチョン

RQ1LM のストリーム出力を外部的に問題分解することなく、検証可能なステップをどう特定するか？
RQ2検証者が問題を示したときに、途中状態をどう検証しLM の推論をどう steering するか？
RQ3部分的トレースへの介入は、タスク・ドメイン横断で効率性（早期停止）および/または正確性（テスト時スケーリング）を改善しうるか？

主な発見

Dataset	Method	Accuracy %	Tokens %
Maze	EAT	88.53	100.00
Maze	DEER	88.53	99.39
Maze	interwhen (k-Stable)	88.53	67.76
Maze	baseline	88.53	100.00
SpatialMap	EAT	74.93	99.66
SpatialMap	DEER	75.00	93.58
SpatialMap	interwhen (k-Stable)	74.93	95.31
SpatialMap	baseline	74.93	100.00
GameOf24	EAT	95.01	100.00
GameOf24	DEER	95.01	96.18
GameOf24	interwhen (k-Stable)	95.45	68.35
GameOf24	baseline	95.01	100.00

自己検証を用いた場合、interwhen は推論モデルの早期停止で最先端の効率性を実現しつつ精度を損なわない。
外部検証者を用いると、interwhen はテスト時スケーリングのベースラインを最大で10ポイントの精度向上としつつ、健全性を100%保証し、少なくとも4倍の効率を実現。
Maze、SpatialMap、GameOf24 では、k-Stable（内部検証）によりトークン使用量を大幅に削減しつつ精度を維持。
Maze および SpatialMap では、外部検証者設定で interwhen は Tree-of-Thought 系と比較して精度が上回り、健全性を維持。GameOf24 でも同等のトークン効率で改善を観測。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。