QUICK REVIEW

[論文レビュー] Do Large Language Models Know What They Don't Know?

Zhangyue Yin, Qiushi Sun|arXiv (Cornell University)|May 29, 2023

Topic Modeling被引用数 7

ひとこと要約

The paper introduces SelfAware, a dataset for unanswerable vs. answerable questions, and a text-similarity-based method to measure LLMs’ self-knowledge, showing that instruction tuning and larger models improve self-knowledge but humans still lag behind.

ABSTRACT

Large language models (LLMs) have a wealth of knowledge that allows them to excel in various Natural Language Processing (NLP) tasks. Current research focuses on enhancing their performance within their existing knowledge. Despite their vast knowledge, LLMs are still limited by the amount of information they can accommodate and comprehend. Therefore, the ability to understand their own limitations on the unknows, referred to as self-knowledge, is of paramount importance. This study aims to evaluate LLMs' self-knowledge by assessing their ability to identify unanswerable or unknowable questions. We introduce an automated methodology to detect uncertainty in the responses of these models, providing a novel measure of their self-knowledge. We further introduce a unique dataset, SelfAware, consisting of unanswerable questions from five diverse categories and their answerable counterparts. Our extensive analysis, involving 20 LLMs including GPT-3, InstructGPT, and LLaMA, discovering an intrinsic capacity for self-knowledge within these models. Moreover, we demonstrate that in-context learning and instruction tuning can further enhance this self-knowledge. Despite this promising insight, our findings also highlight a considerable gap between the capabilities of these models and human proficiency in recognizing the limits of their knowledge.

研究の動機と目的

LLMsが回答不能または未知の質問を認識する能力（自己知識）を評価する。
自己知識を testするために、回答不能と回答可能な質問の多様なデータセットを作成する。
自己知識を定量化する自動化された、テキスト類似度ベースの不確実性検出手法を提案する（F1スコア）。
複数のモデルに対して、モデルサイズ、指示調整、入力形式が自己知識に与える影響を分析する。

提案手法

SelfAwareを、複数ソースから1,032問の回答不能と2,337問の回答可能の計1,369問を構築する。
不確実な意味の文の参照セットとSimCSEを用いた類似度ベースの不確実性検出器を開発する；意味セグメントの閾値0.75とウィンドウ長5のスライディングで使用する。
回答不能な質問を陽性、回答可能を陰性としてF1スコアを計算する。
Direct、Instruction、ICLの入力形式下で、20のLLM（GPT-3、InstructGPT、LLaMA、Alpaca、Vicuna、GPT-4等）を評価する。
比較のため、2名のボランティアが100サンプルをスコアリングして人間の自己知識ベンチマークを確立する。

実験結果

リサーチクエスチョン

RQ1LLMsは回答不能または未知の質問を信頼性高く識別できるか。
RQ2モデルサイズは異なるアーキテクチャと訓練 regime における自己知識にどのように影響するか。
RQ3指示調整と入力形式（Direct、Instruction、ICL）は自己知識にどのような影響を及ぼすか。
RQ4現在のLLMの自己知識は人間の性能とどの程度近いか。
RQ5回答不能質問の主なカテゴリは何で、評価にどのように影響するか。

主な発見

自己知識（不確実性を示す能力）は、入力形式を問わずモデルサイズの増加とともに改善する。
指示調整（InstructGPT等）は、ベースのGPT-3系に比べ自己知識を高める。
イン-context学習と指示/例の提供は、特にdavinciシリーズで自己知識を大幅に向上させる。
GPT-4は評価時点で報告上最も高い自己知識を示し75.47%、しかし人間は84.93%に達する。
最先端のLLMと人間の自己知識には、限界を認識する能力の面で依然として大きなギャップがある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。