QUICK REVIEW

[논문 리뷰] Conformal Prediction with Large Language Models for Multi-Choice Question Answering

Bhawesh Kumar, Charlie Lu|arXiv (Cornell University)|2023. 05. 28.

Topic Modeling인용 수 14

한 줄 요약

본 논문은 conformal prediction을 LLaMA-13B를 사용한 MCQA에 적용하며 커버리지 보장과 선택적 분류를 위한 유용한 불확실성을 보여주고, 작업 간 교환가능성(exchangeability)을 검토한다.

ABSTRACT

As large language models continue to be widely developed, robust uncertainty quantification techniques will become crucial for their safe deployment in high-stakes scenarios. In this work, we explore how conformal prediction can be used to provide uncertainty quantification in language models for the specific task of multiple-choice question-answering. We find that the uncertainty estimates from conformal prediction are tightly correlated with prediction accuracy. This observation can be useful for downstream applications such as selective classification and filtering out low-quality predictions. We also investigate the exchangeability assumption required by conformal prediction to out-of-subject questions, which may be a more realistic scenario for many practical applications. Our work contributes towards more trustworthy and reliable usage of large language models in safety-critical situations, where robust guarantees of error rate are required.

연구 동기 및 목표

고위험 MCQA 작업에서 LLM의 강건한 불확실성 정량화를 동기화한다.
MCQA 출력에 대해 보장된 커버리지를 갖는 예측 집합을 생성하도록 conformal prediction을 적용한다.
conformal 불확실성이 정확도와 어떻게 상관관계가 있으며 선택적 분류를 가능하게 하는지 평가한다.
보정 데이터가 평가 데이터와 다를 때 커버리지에 미치는 교환가능성 가정의 영향을 평가한다.

제안 방법

MCQA를 네 가지 선택지(A-D)를 사용하는 감독 학습 분류 문제로 프레이밍하고 LLaMA-13B로 각 선택지의 로짓을 계산한다.
로짓을 소프트맥스 확률로 변환하고 주제당 10개의 프롬프트를 생성하여 문제당 여러 확률 출력을 얻는다.
목표 커버리지에 맞춘 임계값 q_alpha를 보정하기 위해 LAC(least ambiguous set-valued classifiers)와 함께 conformal prediction을 적용한다.
예측 집합 C(X) = {y : S(X,y) ≤ q_alpha}를 구성하여 교환가능성 하에서 사용자가 지정한 커버리지를 보장한다.
16명의 주제에 대해 무작위 보정/평가 분할을 실험하고 이를 비즈니스, 의학, 컴퓨터 과학으로 그룹화한다.
conformal prediction을 naive top-k 예측과 비교하고 집합 크기와 정확도 간의 관계를 분석한다.

Figure 1 : LLaMA MCQA accuracy is similar for GPT-4 generated questions and real MMLU questions across subjects. For most MMLU subjects, prediction accuracy using one-shot GPT-4 generated questions is similar to when actual MMLU questions are used in one-shot prompts. Results are averaged over ten r

실험 결과

연구 질문

RQ1conformal prediction이 LLM을 사용한 MCQA 작업에서 유효한 커버리지 보장을 제공하는가?
RQ2서로 다른 주제들에 걸쳐 conformal prediction의 불확실성(예측 집합 크기)이 실제 정확도와 어떻게 관련되는가?
RQ3높은 불확실성을 가진 예측을 걸러내어 선택적 분류를 지원할 수 있는가?
RQ4보정 데이터와 평가 데이터 간의 교환가능성 위반이 커버리지 보장에 어떤 영향을 미치는가?
RQ5conformal 보정을 적용하기 전 LLM MCQA에서 naive softmax 출력의 보정 상태는 어떠한가?

주요 결과

Conformal prediction은 모든 주제에서 원하는 커버리지(예: alpha = 0.1에서 90%)를 달성한다.
예측 집합 크기는 top-1 정확도와 음의 상관관계가 있어 불확실한 사례를 걸러 선택적 분류를 가능하게 한다.
conformal prediction으로 생성된 예측 집합은 입력에 따라 크기가 적응하며 고정 크기 top-k 집합보다 커버리지를 더 안정적으로 유지한다.
한 주제에서 보정하고 다른 주제에서 평가하는 것은 주제들이 서로 다른 도메인에 속하면 커버리지가 감소할 수 있어 교환가능성의 한계를 부각한다.
Naive softmax 보정은 평균적으로 괜찮지만 꼬리 분포에서 과소/과대 신뢰를 보이며 conformal 보정 단계를 정당화한다.

Figure 2 : The accuracy distribution across subjects for ten prompts. We plot the distribution of accuracy for ten different one-shot prompts.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.