QUICK REVIEW

[论文解读] Sandwich Reasoning: An Answer-Reasoning-Answer Approach for Low-Latency Query Correction

Chen Zhang, Kepu Zhang|arXiv (Cornell University)|Jan 7, 2026

Information Retrieval and Search Behavior被引用 0

一句话总结

本论文提出 SandwichR，一种用于查询纠错的 Answer–Reasoning–Answer 框架，在给出后续推理的基础上实现快速初步纠错，与后续推理保持一致性，达到最先进的准确率并显著降低时延。

ABSTRACT

Query correction is a critical entry point in modern search pipelines, demanding high accuracy strictly within real-time latency constraints. Chain-of-Thought (CoT) reasoning improves accuracy but incurs prohibitive latency for real-time query correction. A potential solution is to output an answer before reasoning to reduce latency; however, under autoregressive decoding, the early answer is independent of subsequent reasoning, preventing the model from leveraging its reasoning capability to improve accuracy. To address this issue, we propose Sandwich Reasoning (SandwichR), a novel approach that explicitly aligns a fast initial answer with post-hoc reasoning, enabling low-latency query correction without sacrificing reasoning-aware accuracy. SandwichR follows an Answer-Reasoning-Answer paradigm, producing an initial correction, an explicit reasoning process, and a final refined correction. To align the initial answer with post-reasoning insights, we design a consistency-aware reinforcement learning (RL) strategy: a dedicated consistency reward enforces alignment between the initial and final corrections, while margin-based rejection sampling prioritizes borderline samples where reasoning drives the most impactful corrective gains. Additionally, we construct a high-quality query correction dataset, addressing the lack of specialized benchmarks for complex query correction. Experimental results demonstrate that SandwichR achieves SOTA accuracy comparable to standard CoT while delivering a 40-70% latency reduction, resolving the latency-accuracy trade-off in online search.

研究动机与目标

在实时查询纠错中解决准确率与时延的权衡。
提出一种在前端提供快速初步纠错的架构，同时利用后续推理。
开发一种一致性感知的强化学习策略，以对齐初始纠错与最终纠错。
构建高质量、领域多样化的查询纠错数据集用于基准测试。

提出的方法

输出格式：以 Answer–Reasoning–Answer 序列呈现初始纠错、推理轨迹和最终纠错。
两阶段训练：(i) 监督微调（SFT），使用 GPT-4o 生成的推理与纠错来获得 SandwichR 能力；(ii) 一致性感知强化学习（RL）与基于边界的拒绝采样策略。
奖励设计结合准确率（F0.5）、格式惩罚和一致性惩罚，以强制 C_init = C_final。
使用 GRPO 进行策略优化，并采用拒绝采样方案选择推理提升准确性的边界样本。
通过在真实世界查询数据中注入错误/缺失/错序词来构建（噪声、干净）配对的数据。

实验结果

研究问题

RQ1一个 Answer–Reasoning–Answer 框架是否能够在不牺牲推理信息量的前提下提供低时延纠错？
RQ2如何将初始快速纠错与下游推理对齐，以模拟自带链路推理的好处？
RQ3哪些训练策略（SFT+RL）和采样技术最能将推理知识提炼到初始答案中？
RQ4在多样领域（如电子商务、视频、医疗）下，与 Ans-Rea、Rea-Ans、以及传统模型相比，SandwichR 的准确性与时延表现如何？
RQ5是否存在一个实际数据集用于基准复杂查询纠错，能够反映现实世界的噪声？

主要发现

SandwichR 达到可比于标准链路推理方法的最先进纠错准确性。
在实际时延约束下，SandwichR 相较于以推理为先的基线实现 40–70% 的推理加速，同时保持高准确性。
带有一致性奖励和基于边界的拒绝采样的 RL 在多个领域（电商、视频、医疗）上优于 SFT 基线。
SandwichR 在多个数据集与多种错误类型上持续优于 Ans-Rea 和 Rea-Ans 结构。
在有限的令牌预算下，SandwichR 仍能在较低时延下维持更高的准确性，相较于竞争格式表现更优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。