QUICK REVIEW

[论文解读] interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors

Vishak K Bhat, Prateek Chanda|arXiv (Cornell University)|Feb 5, 2026

Formal Methods in Verification被引用 0

一句话总结

本论文提出 interwhen，一种在测试时进行验证的框架，通过元提示可验证状态来引导推理轨迹，使自我或外部可验证的引导提高准确性和效率，同时保持正确性。

ABSTRACT

Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification increasingly important for ensuring correctness. Existing approaches either verify only the final answer, which misses early errors, or rely on branch-and-verify strategies that explore multiple trajectories at substantially higher compute cost. We introduce interwhen, a single-trajectory verification framework that steers model behavior by providing feedback on intermediate verifiable properties. Our method addresses two key challenges. First, extracting intermediate solutions from a reasoning trace typically requires prompt engineering or external task decomposition into fixed steps, which can constrain the model's reasoning strategy. Instead, we periodically poll the reasoning trace and fork inference to recover intermediate solutions without imposing any predefined structure. Second, frequent verifier calls can increase latency; we address this by running verifiers asynchronously and interrupting the main trajectory only when an error is detected, leaving generation unaffected otherwise. This design improves both reliability and efficiency, and naturally supports early stopping based on consistency over recent intermediate solutions. Across benchmarks in code generation and arithmetic, logical and spatial reasoning, interwhen improves accuracy by up to 15 percentage points over standard chain-of-thought execution while staying within 1.5x of token compute cost. Moreover, on every dataset, interwhen achieves a Pareto-optimal operating point between accuracy and efficiency compared to existing test-time verification methods. Code is available at https://github.com/microsoft/interwhen.

研究动机与目标

在高风险领域（法律、金融、物理世界）不仅关注最终答案，还要对语言模型输出进行验证的动机。
提出一个通用框架，在测试时验证并引导部分推理轨迹，而不将问题解决与模型分离外部拆解。
引入元提示以提取可验证的中间状态，并实现可通过自我验证或外部验证者进行验证。
证明在部分轨迹上进行干预可以提升效率（提前停止）或在测试时扩展域（规模化）在多个数据集上的效果。

提出的方法

将可验证状态定义为中间推理状态，使用元提示对输出进行结构化以便提取。
三大核心操作：extract_state、verify、intervene，用于实现任何基于验证者的或自我验证策略。
给出一个顺序验证器算法，维护单一路径并在状态失败时将验证者反馈内联附加，从而实现自适应纠错。
案例研究：用于提前停止的内部验证（k-Stable Answer）和用于测试时扩展的外部验证，配以结构化提示和验证器。
通过设计确保在使用外部验证者时的正确性得到保证。

实验结果

研究问题

RQ1如何在不对外拆解问题的情况下识别语言模型输出流中的可验证步骤？
RQ2在验证者指示问题时，如何验证中间状态并引导语言模型的推理？
RQ3在各任务和领域中，干预部分轨迹是否能提升效率（提前停止）和/或准确性（测试时扩展）？

主要发现

数据集	方法	准确率 %	令牌占比 %
Maze	EAT	88.53	100.00
Maze	DEER	88.53	99.39
Maze	interwhen (k-Stable)	88.53	67.76
Maze	baseline	88.53	100.00
SpatialMap	EAT	74.93	99.66
SpatialMap	DEER	75.00	93.58
SpatialMap	interwhen (k-Stable)	74.93	95.31
SpatialMap	baseline	74.93	100.00
GameOf24	EAT	95.01	100.00
GameOf24	DEER	95.01	96.18
GameOf24	interwhen (k-Stable)	95.45	68.35
GameOf24	baseline	95.01	100.00

通过自我验证，interwhen 在推理模型的提前停止方面达到并列的最先进效率，且不损失准确性。
在使用外部验证者时，interwhen 在测试时扩展基线的基础上将准确性提高最多个百分点，同时确保 100% 的正确性，且效率至少提高 4 倍。
在 Maze、SpatialMap 和 GameOf24 上，k-Stable（内部验证）显著减少令牌使用量且保持准确性。
在 Maze 和 SpatialMap 上，带有外部验证器的 interwhen 在准确性方面优于 Tree-of-Thought 变体，同时保持正确性；在 GameOf24 上，获得相当的令牌效率的同时出现提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。