Skip to main content
QUICK REVIEW

[论文解读] When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?

Xinyu Zhou, Chang Jin|arXiv (Cornell University)|Feb 4, 2026
Topic Modeling被引用 0
一句话总结

这篇论文研究训练大语言模型在时间性问答中实现回避,采用CoT-SFT初始阶段,随后进行带回避奖励的强化学习(RL),结果显示RL可以提升TimeQA的精确匹配和不可回答的真阳性;而SFT可能导致过度自信。

ABSTRACT

Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by $3.46\%$ and $5.80\%$ in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by $20\%$ over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.

研究动机与目标

  • 调查信息类型和训练方法如何影响带回避的大模型时间推理。
  • 评估强化学习(RL)是否能超越有监督微调在带回避的时间推理中的提升。
  • 考察隐式与显式推理线索对时间性问答回避性能的影响。
  • 提供一个将链式推理(CoT)监督与具回避奖励的RL相结合的管线。

提出的方法

  • 界定带回避的时间性问答并比较隐式信号(上下文、按时间筛选的上下文、知识图谱)与显式(CoT)推理信号。
  • 提出基于GRPO的强化学习目标,以在KL正则化的策略更新下优化回避与推理。
  • 构建CoT-SFT冷启动,使用高质量CoT数据进行微调,并结合奖励格式、答案准确性与回避信号进行RL微调。
  • 设计时间相关的子上下文提取与知识图谱提取,为模型提供隐式推理线索。
  • 在多种模型规模与配置(SFT vs RL)上,在TimeQA Easy/Hard与非时序OOD数据集上进行评估。

实验结果

研究问题

  • RQ1RL调优配合带回避奖励是否在时间性问答任务上优于有监督方法?
  • RQ2不同信息类型(原始上下文、时间筛选的子上下文、知识图谱)如何影响回避与时间推理?
  • RQ3显式CoT监督是否在时间性问答的回避方面相比隐式线索具有优势?
  • RQ4在不同训练机制下,总体准确率与回避能力之间有哪些权衡?
  • RQ5回避能力对非时序、分布外问答任务的迁移效果如何?

主要发现

  • RL在推理方面表现显著:一个1.5B参数的模型经过RL后在TimeQA Easy/Hard上超越GPT-4o,提升幅度为3.46–5.80个EM百分点。
  • RL训练将对不可回答问题的真阳性率提升约20个百分点,相较纯SFT变体。
  • SFT倾向于引发过度自信并损害可靠性,而RL能提升预测准确性,但回避风险仍与SFT相似。
  • 相较于显式CoT监督,隐式推理线索(原始上下文、时间相关子上下文、知识图谱)对带回避的推理帮助有限。
  • 较小模型在CoT-SFT冷启动下也可取得竞争性结果,而较大模型若不使用RL回报收益递减;CoT-SFT对实现有效RL增益至关重要。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。