QUICK REVIEW

[論文レビュー] Sandwich Reasoning: An Answer-Reasoning-Answer Approach for Low-Latency Query Correction

Chen Zhang, Kepu Zhang|arXiv (Cornell University)|Jan 7, 2026

Information Retrieval and Search Behavior被引用数 0

ひとこと要約

この論文は SandwichR を紹介します。これは問い合わせ補正のための Answer–Reasoning–Answer フレームワークで、 post-hoc reasoning に整合した迅速な初期補正を提供し、遅延を大幅に削減しつつ最先端の精度を達成します。

ABSTRACT

Query correction is a critical entry point in modern search pipelines, demanding high accuracy strictly within real-time latency constraints. Chain-of-Thought (CoT) reasoning improves accuracy but incurs prohibitive latency for real-time query correction. A potential solution is to output an answer before reasoning to reduce latency; however, under autoregressive decoding, the early answer is independent of subsequent reasoning, preventing the model from leveraging its reasoning capability to improve accuracy. To address this issue, we propose Sandwich Reasoning (SandwichR), a novel approach that explicitly aligns a fast initial answer with post-hoc reasoning, enabling low-latency query correction without sacrificing reasoning-aware accuracy. SandwichR follows an Answer-Reasoning-Answer paradigm, producing an initial correction, an explicit reasoning process, and a final refined correction. To align the initial answer with post-reasoning insights, we design a consistency-aware reinforcement learning (RL) strategy: a dedicated consistency reward enforces alignment between the initial and final corrections, while margin-based rejection sampling prioritizes borderline samples where reasoning drives the most impactful corrective gains. Additionally, we construct a high-quality query correction dataset, addressing the lack of specialized benchmarks for complex query correction. Experimental results demonstrate that SandwichR achieves SOTA accuracy comparable to standard CoT while delivering a 40-70% latency reduction, resolving the latency-accuracy trade-off in online search.

研究の動機と目的

リアルタイム問い合わせ補正における精度と遅延のトレードオフに取り組む。
高速な初期補正を前方に出しつつ、後続の推論を活用するアーキテクチャを提案する。
初期補正と最終補正を一致させるための一貫性を意識した強化学習戦略を開発する。
ベンチマーク用の高品質で領域多様な問い合わせ補正データセットを構築する。

提案手法

出力形式: 初期補正、推論経路、最終補正を Answer–Reasoning–Answer シーケンスで提供。
二段階トレーニング: (i) GPT-4o が生成した推論と補正を用いて SandwichR 能力を獲得する教師ありファインチューニング（SFT）；(ii) margin ベースのリジェクションサンプリング戦略を用いた一貫性を意識した強化学習（RL）。
報酬設計は、Accuracy（F0.5）とフォーマットペナルティおよび一貫性ペナルティを組み合わせ、C_init = C_final を強制する。
政策最適化には GRPO を用い、推論によって精度が改善するボーダーラインサンプルを選択するリジェクションサンプリング方式を導入。
実世界の問い合わせデータに誤った/欠落した/順序が乱れた語を注入して (noise, clean) ペアを作成するデータ構築。

実験結果

リサーチクエスチョン

RQ1Answer–Reasoning–Answer フレームワークは、推論情報を用いつつ低遅延の補正を提供できるか？
RQ2初期の高速補正を下流の推論に合わせ、CoT の利点を模倣できるか？
RQ3SFT + RL とサンプリング手法は、推論を初期回答へ最良に蒸留できるか？
RQ4SandwichR は Ans-Rea, Rea-Ans および従来モデルと比較して、精度と遅延の点で多様な領域でどう評価されるか？
RQ5現実世界のノイズを反映した複雑な問い合わせ補正をベンチマークする実用的なデータセットは存在するか？

主な発見

SandwichR は標準の Chain-of-Thought アプローチと同等の最先端の補正精度を達成する。
実用的な遅延制約の下で、SandwichR は推論優先のベースラインより 40–70% 高速な推論を実現しつつ高精度を維持する。
一貫性報酬と margin-based リジェクションサンプリングを用いた RL は、複数ドメイン（E コマース、動画、医療）で SFT ベースラインを上回る性能を向上させる。
SandwichR は様々なデータセットとエラータイプで Ans-Rea および Rea-Ans より一貫して上回る。
制約されたトークン予算下でも、SandwichR は競合フォーマットより高い精度と低遅延を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。