QUICK REVIEW

[논문 리뷰] How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Zhengbao Jiang, Jun Araki|arXiv (Cornell University)|2020. 12. 02.

Topic Modeling참고 문헌 61인용 수 40

한 줄 요약

이 논문은 QA 작업에 대해 보정된 QA 태스크에 대해 언어 모델이 실제 정답의 가능성과 일치하는지 살펴보고, 정확도를 해치지 않으면서 보정을 개선하기 위한 파인튜닝 및 사후 보정 방법을 제안한다.

ABSTRACT

Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine this question from the point of view of calibration, the property of a probabilistic model's predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs. We have released the code at https://github.com/jzbjyb/lm-calibration.

연구 동기 및 목표

최신 QA LMs(T5, BART, GPT-2)가 보정된 신뢰도 추정치를 생성하는지 평가한다.
파인튜닝과 사후 보정 조정을 통해 보정을 개선하는 방법을 개발하고 평가한다.
보정 접근의 강점/제약을 분석하고 미래 개선에 대한 통찰력을 제공한다.

제안 방법

QA 데이터셋을 입력 X와 후보 출력 Y가 있는 시퀀스-투-시퀀스 작업으로 처리하고, P_LM(Y|X)와 후보 집합에 대한 정규화된 확률을 계산한다.
후보 집합에 대해 소프트맥스 기반 및 마진 기반의 두 가지 파인튜닝 목표를 제안하여 후보 확률을 정답성과 맞추려 한다.
후처리 보정 조사: 온도 기반 스케일링과 입력/특징을 활용한 의사결정 트리를 사용하여 신뢰도를 재보정한다.
LM 특화 기법 도입: 라운드 트립 번역을 통한 후보 출력의 의역으로 표현 편향을 줄이고, 맥락 검색을 통한 입력 보강을 수행한다.
위키피디아 발췌를 사용한 입력 보강의 평가.
모델 크기, 문장 재작성(Paraphrase) 수, 다양한 데이터셋이 보정에 미치는 영향을 연구하기 위한 제거/변형 실험(ablation)을 수행한다.

실험 결과

연구 질문

RQ1LM 기반 QA 모델이 다양한 QA 태스크에서 신뢰도가 정답 가능성과 일치하도록 보정될 수 있는가?
RQ2정확도를 해치지 않으면서 보정을 향상시키는 최적의 파인튜닝 또는 사후 보정 전략은 무엇인가?
RQ3입력 변형(의역, 검색된 맥락)이 보정 성능에 어떤 영향을 미치는가?
RQ4모델 크기가 데이터셋 간 보정 품질에 미치는 영향은 무엇인가?

주요 결과

Baseline LMs (T5, UnifiedQA) show strong accuracy but poor calibration (ECE > 0.2 on MT-test).
Fine-tuning and post-hoc calibration methods improve ECE while maintaining or improving accuracy on multi-choice QA datasets.
The best performing setup (Combo: margin-based fine-tuning plus temperature scaling, paraphrasing, and input augmentation) reduces ECE from 0.095 to 0.044 on MC-test (53% relative reduction).
Paraphrasing candidate answers and providing retrieved contextual evidence significantly boosts calibration, especially for shorter questions.
Calibration is complementary across methods; larger models generally show both higher accuracy and better calibration, though domain-shift effects exist.
On extractive QA, calibration improvements are smaller, likely due to harder candidate span generation; higher entropy in confidence distributions may contribute.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.