QUICK REVIEW

[論文レビュー] Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation

Zhouhong Gu, Xiaoxuan Zhu|arXiv (Cornell University)|Jun 9, 2023

Topic Modeling被引用数 9

ひとこと要約

Xiezhi は、516 の分野を横断する holistic domain knowledge を評価する包括的で自動更新されるベンチマークで、249,587 問の他、47 の LLM の跨領域能力を検出する専門サブセットを含む。

ABSTRACT

New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge. Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. We anticipate Xiezhi will help analyze important strengths and shortcomings of LLMs, and the benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.

研究の動機と目的

多くの分野にわたる LLM の能力を識別できる、より新しく広範なベンチマークの必要性を喚起する。
中国学科分類法に合わせた大規模で自動更新されるドメイン知識ベンチマークを提案する。
ドメイン特有の推論と跨域推論を反映する、Xiezhi-Specialty および Xiezhi-Interdiscipline といった専門サブセットを作成する。
各MCQにつき50オプションの設定と生成確率によるランキングを設計し、真のモデル能力を露呈させる。

提案手法

13 の大分類に由来する516 分野にまたがる 249,587 問のMCQ項目を構成する。
Graduate Entrance Examinations から 20k 問を手動で注釈付けして、マルチラベル分野タグ付けを伴って Xiezhi-Meta を形成する。
アノテーションモデルと微調整済み分類器を用いて、さまざまな試験から 170k 問を自動生成・注釈付けし、さらに調査から 80k を注釈付けする。
微妙な評価のため、Xiezhi-Specialty（3 分野以下）と Xiezhi-Interdiscipline（4+ 分野）を構築する。
50オプションMCQを導入し、ランダム推測効果を減らすために指示ベースの選択ではなく生成確率でのランキングを行う。
中国語と英語で、0-shot, 1-shot, 3-shot の設定下で、47 のオープンソース LLM と API ベースのモデル（ChatGPT, GPT-4）を評価する。

Figure 1: In Chinese mythology, the Xiezhi is a legendary creature known for its ability to discern right from wrong and uphold justice. Xiezhi Benchmark encompasses 13 distinct disciplinary categories, 118 sub-disciplines, and 385 further fine-grained disciplines, aiming to provide an extensive dom

実験結果

リサーチクエスチョン

RQ1516 の分野に跨る総合的なドメイン知識ベンチマークのカバー範囲、鮮度、およびラベリング品質はどのようか。
RQ2複数のドメインで 50 オプション MCQ と生成確率ランキングで評価した場合、現代の LLM はどのように性能を示すか？
RQ3Xiezhi-Specialty および Xiezhi-Interdiscipline といった専門データセットは、全ベンチマークと比べて LLM の明確な強みや制約を示すか？
RQ4事前学習とファインチューニングのドメイン知識性能への影響はどうか、モデルサイズとデータのバランスが結果にどう影響するか？
RQ5Xiezhi はGPT-4 から小さなパラメータモデルまで、LLM の微細な能力差を識別できるか？

主な発見

ドメインデータでファインチューニングされたトップのオープンソース LLM は、科学、工学、農学、医療で平均的な人間を上回るが、経済学、法学、教育学、文学、歴史、経営学では遅れを取る。
GPT-4 と ChatGPT は強力な few-shot の改善を示す一方、多くの小型 LLM はデモンストレーションから一貫して恩恵を受けられない。
モデルサイズだけが性能向上を保証するわけではなく、選択されたアーキテクチャとトレーニングデータのバランスが結果を左右する。
医療ドメインの専門的ファインチューニングは高い医療ドメイン性能を生むが、汎用ドメインの理解を犠牲にすることがある。
Xiezhi はベンチマークの中で最高の性能分散を示し、モデル間の LLM 能力差を鋭く識別する能力を示している。

Figure 2: The figure on the right is the statistics of all questions collected by Xiezhi. The middle figure shows statistics for Xiezhi-Specialty and the left shows Xiezhi-Interdiscipline.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。