[论文解读] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers
简介提出Proactive Interactive Reasoning (PIR)——一种使推理的大模型主动向用户寻求澄清并将提问与推理交错进行的范式,从而提高准确性和效率。
Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}
研究动机与目标
- 识别当前推理大模型在提示不完整或模糊时的盲目自我思考问题。
- 开发PIR,使推理过程中的主动澄清与用户意图对齐成为可能。
- 创建一个不确定性感知的数据增强管线和一个强化学习框架,以优化交互行为。
- 在数学推理、代码生成和文档编辑等任务上演示PIR的有效性。
- 评估对事实知识、问题回答和缺失前提场景的泛化能力。
提出的方法
- 通过不确定性感知的交互式数据增强,将高不确定性推理步骤转化为think-and-ask轨迹,插入澄清性问题和模拟用户回答。
- 在增强的think-then-ask序列上进行有监督微调,教会推理、提问与整合反馈之间的突然切换。
- US-GRPO:一个带有动态用户模拟器的Group Relative Policy Optimization框架,用于在组合奖励下优化主动提问。
- 将任务成功(外部奖励)与交互质量指标(内部奖励)结合的组合奖励,以在准确性、效率和有益澄清之间取得平衡。
- 通过GRPO进行KL正则化的策略更新,以在不需要单独训练价值函数的情况下稳定学习。

实验结果
研究问题
- RQ1LLMs在推理过程中是否能够检测缺失的前提或意图差距并主动提出澄清性问题?
- RQ2主动交互式推理在不同不确定性结构的任务中是否提升准确性、效率和鲁棒性?
- RQ3用户模拟器质量和奖励设计如何影响学习与泛化能力?
主要发现
- PIR在与强基线相比的不同任务中,准确性最高提升32.70%,通过率提升22.90%,BLEU提升41.36。
- PIR使每个任务的推理计算量减少约2k个token,并将不必要的交互轮次减半。
- 带动态用户模拟器的US-GRPO对学习有效提问策略和在交互中稳定推理至关重要。
- PIR对非交互基准具有泛化性,在事实知识、问答和缺失前提场景中表现出鲁棒性。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。