QUICK REVIEW

[論文レビュー] A Survey on Evaluation of Large Language Models

Yupeng Chang, Xu Wang|arXiv (Cornell University)|Jul 6, 2023

Topic Modeling被引用数 195

ひとこと要約

この論文は大規模言語モデル（LLMs）の評価方法を、評価対象、評価場所、評価方法の観点で網羅的に調査し、タスク・ベンチマーク・課題を強調する。

ABSTRACT

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

研究の動機と目的

既存の評価タスクを、NLP、推論、倫理、教育、科学、応用領域を横断してLLMsに対して要約する。
LLMの性能評価に用いられる評価データセットとベンチマークを分析する。
自動評価と人間評価を含む評価手法を論じ、長所と限界を明らかにする。
原理的で頑健かつ総合的なLLM評価の重大な課題と今後の方向性を強調する。

提案手法

LLM評価を「評価対象」「評価場所」「評価方法」の3つの次元に分類する。
NLPタスク（NLP、推論、マルチリンガル、事実性）および他の領域（医療、倫理、社会科学、エージェント応用）をレビューする。
LLM評価に用いられる一般・特定のベンチマークとデータセットを整理する（GLUE、MMLU、BIG-bench など）。
自動評価と人間評価アプローチを検討し、LLM評価における役割を論じる。
LLM評価の重大な課題とオープンソース資源（オープンリポジトリ）を議論する。」,

Figure 3. The evaluation process of AI models.

実験結果

リサーチクエスチョン

RQ1LLMを評価するためにどのような評価タスクが用いられ、それが強みと弱点について何を示すか？
RQ2LLMはどこで評価されるか（どのデータセット・ベンチマークか）、どのベンチマークが彼らの能力をうまく捉えるか？
RQ3LLMはどのように評価されるか（自動 vs 人間、プロトコル設計）、現在の評価手法の限界は何か？
RQ4LLM評価における重大な課題と今後の方向性は何か？
RQ5より頑健で信頼できるLLMsの開発を導くためにどのような洞察が得られるか？

主な発見

LLMsは多くのNLPタスクで高い性能を示す一方で、特定の推論・語義理解領域で弱点を示す。
評価ベンチマークは範囲にばらつきがあり、出現する能力や安全性の考慮を十分には捉え切れていない場合がある。
自動評価と人間評価の双方が不可欠だが、それぞれ信頼性・解釈に影響を与える制約がある。
一般・領域特有のタスク、頑健性、信頼性を包括する統一的で原則的な評価フレームワークの必要がある。
オープンソース資料と継続的なベンチマーク開発は、LLM評価の協働的な進歩に不可欠である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。