QUICK REVIEW

[論文レビュー] How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Zhengbao Jiang, Jun Araki|arXiv (Cornell University)|Dec 2, 2020

Topic Modeling参考文献 61被引用数 40

ひとこと要約

本論文は、QAタスクに対して較正された言語モデルが真の正確さをどれだけ反映しているかを検討し、精度を損なうことなく較正を改善するための微調整と事後補正手法を提案している。

ABSTRACT

Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine this question from the point of view of calibration, the property of a probabilistic model's predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs. We have released the code at https://github.com/jzbjyb/lm-calibration.

研究の動機と目的

最先端のQA言語モデル（T5、BART、GPT-2）が較正された信頼度推定を生成するかを評価する。
微調整と事後補正によって較正を改善する手法を開発・評価する。
較正手法の長所と限界を分析し、今後の改善に向けた示唆を提供する。

提案手法

QAデータセットを入力 X と候補出力 Y を用いたシーケンス到達シーケンスタスクとして扱い、P_LM(Y|X)と候補集合全体に対する正規化確率を計算する。
候補集合に対するソフトマックス基盤とマージン基盤の2つの微調整目的を提案し、候補確率を正確性に整合させる。
事後較正を検討する：温度スケーリングと入力/特徴を用いた特徴量ベースの決定木による信頼度の再較正。
LM特有の手法を導入：候補出力を往復翻訳によって言い回しの偏りを減らすパラフレーズと、文脈を取り込む入力増強。
入力を補強するためにWikipediaの抜粋を用いたリトリーバル増強を評価。
モデルサイズ、パラフレーズ回数、異なるデータセットが較正に与える影響を調べるアブレーションを用いる。）
research_questions [
Can LM-based QA models be calibrated so their confidence aligns with likelihood of correctness across diverse QA tasks?
What finetuning or post-hoc strategies best improve calibration without sacrificing accuracy?
How do input variations (paraphrasing, retrieved context) affect calibration performance?
How does model size impact calibration quality across datasets?

実験結果

リサーチクエスチョン

RQ1Can LM-based QA models be calibrated so their confidence aligns with likelihood of correctness across diverse QA tasks?
RQ2What finetuning or post-hoc strategies best improve calibration without sacrificing accuracy?
RQ3How do input variations (paraphrasing, retrieved context) affect calibration performance?
RQ4How does model size impact calibration quality across datasets?

主な発見

Baseline LMs (T5, UnifiedQA) show strong accuracy but poor calibration (ECE > 0.2 on MT-test).
Fine-tuning and post-hoc calibration methods improve ECE while maintaining or improving accuracy on multi-choice QA datasets.
The best performing setup (Combo: margin-based fine-tuning plus temperature scaling, paraphrasing, and input augmentation) reduces ECE from 0.095 to 0.044 on MC-test (53% relative reduction).
Paraphrasing candidate answers and providing retrieved contextual evidence significantly boosts calibration, especially for shorter questions.
Calibration is complementary across methods; larger models generally show both higher accuracy and better calibration, though domain-shift effects exist.
On extractive QA, calibration improvements are smaller, likely due to harder candidate span generation; higher entropy in confidence distributions may contribute.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。