QUICK REVIEW

[论文解读] Conformal Prediction with Large Language Models for Multi-Choice Question Answering

Bhawesh Kumar, Charlie Lu|arXiv (Cornell University)|May 28, 2023

Topic Modeling被引用 14

一句话总结

该论文将 conformal prediction 应用于带有 LLaMA-13B 的 MCQA，展示覆盖保障和对选择性分类有用的不确定性，并考察跨任务的可交换性。

ABSTRACT

As large language models continue to be widely developed, robust uncertainty quantification techniques will become crucial for their safe deployment in high-stakes scenarios. In this work, we explore how conformal prediction can be used to provide uncertainty quantification in language models for the specific task of multiple-choice question-answering. We find that the uncertainty estimates from conformal prediction are tightly correlated with prediction accuracy. This observation can be useful for downstream applications such as selective classification and filtering out low-quality predictions. We also investigate the exchangeability assumption required by conformal prediction to out-of-subject questions, which may be a more realistic scenario for many practical applications. Our work contributes towards more trustworthy and reliable usage of large language models in safety-critical situations, where robust guarantees of error rate are required.

研究动机与目标

为高风险 MCQA 任务中对 LLM 的鲁棒不确定性量化提供动机。
使 conformal prediction 适用于为 MCQA 输出产生具有覆盖保障的预测集合。
评估 conformal 不确定性与准确性的相关性，以及其在实现选择性分类中的潜在作用。
评估 calibration 数据与 evaluation 数据差异对覆盖保障的影响。

提出的方法

将 MCQA 作为使用四个选项（A-D）的监督学习分类任务，并用 LLaMA-13B 计算每个选项的 logits。
将 logits 转换为 softmax 概率，并为每个主体生成十个提示，以获得每个问题的多组概率输出。
对 least ambiguous set-valued classifiers (LAC) 应用 conformal prediction 来校准目标覆盖率的阈值 q_alpha。
构建预测集合 C(X) = {y : S(X,y) ≤ q_alpha}，在 exchangeability 下保证用户指定的覆盖。
在 16 个主体上进行随机 calibration/evaluation 划分，分组为商业、医学和计算机科学。
将 conformal prediction 与 naive top-k 预测进行比较，并分析集合大小与准确性的关系。

Figure 1 : LLaMA MCQA accuracy is similar for GPT-4 generated questions and real MMLU questions across subjects. For most MMLU subjects, prediction accuracy using one-shot GPT-4 generated questions is similar to when actual MMLU questions are used in one-shot prompts. Results are averaged over ten r

实验结果

研究问题

RQ1Conformal prediction 是否在使用 LLM 的 MCQA 任务上提供有效的覆盖保障？
RQ2在不同主体中，conformal prediction 的不确定性（预测集合大小）如何与实际准确性相关？
RQ3conformal prediction 能否通过筛选高不确定性预测来支持选择性分类？
RQ4当 calibration 数据与 evaluation 数据之间的可交换性被违反时，覆盖保障会有何影响？
RQ5在应用 conformal calibration 之前，LLM MCQA 中的 naives softmax 校准状态如何？

主要发现

Conformal prediction 在所有主体中实现了所需的覆盖率（例如 α=0.1 时为90%）。
预测集合大小与 top-1 准确率呈负相关关系，通过筛选不确定的情况实现选择性分类。
由 conformal prediction 产生的预测集合在不同输入下会自适应大小，相较固定大小的 top-k 集合更可靠地保持覆盖。
在一个主体上进行校准后在不同主体上评估若主体来自不同领域，可能降低覆盖，从而凸显了可交换性限制。
Naive softmax 校准在平均水平上相当不错，但在尾部分布上存在过度自信与不足自信的情况，因此证明了 conformal 校准步骤的合理性。

Figure 2 : The accuracy distribution across subjects for ten prompts. We plot the distribution of accuracy for ten different one-shot prompts.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。