QUICK REVIEW

[論文レビュー] Evaluating Large Language Models: A Comprehensive Survey

Zishan Guo, Renren Jin|arXiv (Cornell University)|Oct 30, 2023

Topic Modeling被引用数 61

ひとこと要約

知識/能力、整合性、安全性、およびドメイン固有のベンチマークに跨るLLMの評価手法を分類・総括し、評価プラットフォームと今後の方向性を論じる総合的な調査。

ABSTRACT

Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.

研究の動機と目的

LLM評価を知識と能力、整合性、および安全性の観点に分類し、厳密な評価を導くための指針とする。
評価次元全体の既存のベンチマーク、データセット、手法をレビューする。
専門領域（生物学、教育、法、CS、金融）における評価の取り組みを要約する。
包括的で責任あるLLM展開を支援するため、評価プラットフォームと組織を分析する。

提案手法

5つの主要ドメインとサブドメインを含むLLM評価の分類を提示する。
知識、推論、ツール学習、倫理、毒性、正確性、頑健性に関するデータセット、ベンチマーク、評価実践を調査する。
専門的なLLM評価と、それらがドメイン特有のタスクにどのように適用されるかを論じる。
従来の調査を比較し、能力と整合性の評価を横断する独自の貢献と統合を説明する。
リスク評価、エージェント評価、動的評価、強化志向の評価など、今後の方向性を概説する。

実験結果

リサーチクエスチョン

RQ1LLM評価の主要な次元とサブ次元（知識、整合性、安全性、専門化）は何で、それらはどのように相互関連するのか？
RQ2LLMの能力、ヒトの価値観への整合性、および安全性リスクを評価するために、どのようなベンチマーク、データセット、手法が用いられているか？

主な発見

LLMsは知識と能力、整合性、安全性にわたって評価され、専門領域の評価が適用範囲を拡げている。
質問応答、知識補完、さまざまな推論タイプ、ツール学習など、多様なベンチマークと手法が存在する。
能力、整合性、安全性、および適用性を統合的にカバーする評価プラットフォームの構築が進んでいる。
最近の研究は真実性、毒性、頑健性、潜在的リスクの評価の必要性と限界を浮き彫りにしている。
本調査は、能力と整合性の視点を統合して全体的な見解を得ることで、従来の総説を補完・拡張している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。