QUICK REVIEW

[논문 리뷰] When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?

Xinyu Zhou, Chang Jin|arXiv (Cornell University)|2026. 02. 04.

Topic Modeling인용 수 0

한 줄 요약

본 논문은 CoT-SFT 초기 단계에 이어 회피를 고려한 보상으로 RL을 활용하여 시간적 QA에서 LLM을 거절하도록 학습시키는 방법을 연구하고, RL이 TimeQA의 정확 일치(exact-match)를 향상시키고 답변 불가(true-positives) 사례를 개선시킬 수 있음을 보여주며, 반면 SFT는 과신(overconfidence)을 유발할 수 있다.

ABSTRACT

Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by $3.46\%$ and $5.80\%$ in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by $20\%$ over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.

연구 동기 및 목표

정보 유형과 학습 방법이 LLM의 시간 추론 및 거절에 미치는 영향을 연구한다.
감독 학습 fine-tuning을 넘는 거절 인식 시간 추론 강화를 위해 강화 학습(RL)이 도움이 되는지 평가한다.
암묵적 대 암묵적 추론 신호가 시간 QA의 거절 성능에 미치는 영향을 검토한다.
Chain-of-Thought 감독과 거절 인식 RL 보상을 결합한 파이프라인을 제공한다.

제안 방법

거절을 포함한 시간 QA를 정의하고 암묵적(맥락, 시간 필터링된 맥락, 지식 그래프) 대 암호 explicit 추론 신호를 비교한다.
KL-정규화 정책 업데이트를 포함하는 거절 및 추론을 최적화하기 위한 GRPO 기반 강화 학습 목표를 제시한다.
고품질 CoT 데이터를 활용한 CoT-SFT 콜드 스타트를 구성하고 보상 결합 형식을 사용하여 RL로 미세조정한다(정답 정확도 및 거절 신호를 결합한 보상).
모델에 암묵적 추론 신호를 제공하기 위한 시간 관련 하위 맥락 추출 및 지식 그래프 추출을 설계한다.
TimeQA Easy/Hard 및 비시간적 OOD 데이터셋에서 다양한 모델 크기 및 구성(SFT vs RL)으로 평가한다.

실험 결과

연구 질문

RQ1RL-튜닝이 거절 인식 보상을 통해 시간 QA 작업에서 감독 학습 접근법보다 성능을 surpass하는가?
RQ2다양한 정보 유형(원래 맥락, 시간 필터링 하위 맥락, 지식 그래프)이 거절 및 시간 추론에 어떤 영향을 미치는가?
RQ3명시적 CoT 감독이 암묵적 신호에 비해 시간 QA의 거절에 어떤 이점을 제공하는가?
RQ4다양한 학습 방식에서 전체 정확도와 거절 능력 간의 트레이드오프는 무엇인가?
RQ5거절 능력이 비시간적, out-of-distribution QA 작업으로 얼마나 잘 전달되는가?

주요 결과

RL은 추론에서 강한 이점을 제공한다: 1.5B 모델의 RL이 TimeQA Easy/Hard에서 GPT-4o보다 3.46–5.80 EM 포인트를 상회한다.
RL 학습은 순수 SFT 변형에 비해 대답 불가한 질문의 진짜 양성 비율을 약 20포인트 증가시킨다.
SFT는 과신을 유발하고 신뢰성에 해를 끼치는 경향이 있으며, RL은 예측 정확도를 개선하지만 여전히 SFT와 유사한 거절 리스크를 수반한다.
암묵적 추론 신호(원래 맥락, 시간 관련 하위 맥락, 지식 그래프)는 명시적 CoT 감독에 비해 거절을 동반한 추론에 제한된 이점을 제공한다.
CoT-SFT 콜드 스타트가 더 작은 모델에서도 경쟁력 있는 결과를 낼 수 있는 반면, 큰 모델은 RL 없이 수익이 감소하며, 효과적인 RL 이득을 가능하게 하는 것은 CoT-SFT가 필수적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.