QUICK REVIEW

[論文レビュー] Benchmarking LLMs via Uncertainty Quantification

Fanghua Ye, Mingming Yang|arXiv (Cornell University)|Jan 23, 2024

Topic Modeling被引用数 8

ひとこと要約

本論文は、適合予測を用いた不確実性を組み込んだオープンソース LLMs のベンチマークフレームワークを提案し、不確実性と精度を組み合わせた新しい指標 UAcc を生み出す。

ABSTRACT

The proliferation of open-source Large Language Models (LLMs) from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our examination involves nine LLMs (LLM series) spanning five representative natural language processing tasks. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs. These results underscore the significance of incorporating uncertainty in the evaluation of LLMs.

研究の動機と目的

精度に加えて不確実性を考慮してLLMsを評価する必要性を動機づける。
LLMsのための適合予測に基づく不確実性量化手法を提案する。
MCQA再構成を用いて五つのNLPタスクで八つのオープンソース LLMs をベンチマークする。
不確実性を考慮した精度（UAcc）指標を導入し検証する。

提案手法

五つのNLPタスクを選択肢問題に変換し、LLMs から各選択肢のソフトマックススコアを取得する。
二つのコンフォーマルスコア（LAC および APS）を用いて適合予測を適用し、カバレッジ保証を持つ予測集合を生成する。
Base、Shared Instruction、Task-specific Instruction の三つのプロンプト戦略の下で、ベースの事前学習済みと指示微調整済みのLLMバリアントを比較する。
Accuracy (Acc)、Set Size (SS)、および UAcc を Acc/SS * sqrt(|Y|)として定義し評価する。
モデル規模、指示微調整、および校正データの割合が不確実性と性能に与える影響を調査する。

Figure 1: An illustration of two LLMs accurately predicting the true answer (with option A possessing the highest probability), but showing different levels of uncertainty. Note that when both LLMs predict a wrong answer, they may also display different uncertainties.

実験結果

リサーチクエスチョン

RQ1適合予測を用いて定量化された不確実性は、さまざまな LLMs における従来の精度とどのように関連するか？
RQ2実用的なベンチマークにおいて、より大きなモデルサイズは不確実性を増加させるのか、減少させるのか。
RQ3指示微調整が精度、不確実性、および提案された UAcc 指標にどのような影響を与えるか。
RQ4UAcc は精度のみと比較して LLMs の相対ランキングを変えることができるか。
RQ5校正データの割合が不確実性量化に及ぼす影響は何か。

主な発見

より高い精度を持つLLMsは実践では不確実性が高まる場合がある。
より大規模なLLMsは一部のタスクで小規模なものより高い不確実性を示すことがある。
指示微調整は不確実性を増加させる傾向がある。
UAcc 指標は相対的改善を拡大または抑制し、ランキングを変える可能性がある。
彼らの設定では、校正データの割合はカバレッジ、SS、および UAcc にほとんど影響を示さない。

Figure 2: The overall process of applying conformal prediction for uncertainty quantification in LLMs. (a) Five distinct tasks are considered, and a dataset comprising 10,000 instances is prepared for each task. (b) Each data instance is transformed into a multiple-choice question, and eight LLMs (L

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。