QUICK REVIEW

[論文レビュー] CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li, Yixuan Zhang|arXiv (Cornell University)|Jun 15, 2023

Topic Modeling被引用数 16

ひとこと要約

CMMLUは、67トピックにわたる20以上のLLMを評価する包括的な中国語 multitask ベンチマークです。ほとんどのモデルは60%の精度に達するのに苦労しており、GPT-4の平均は約71%です。

ABSTRACT

As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.

研究の動機と目的

多様な分野で中国語の大規模言語モデルの知識と推論能力を評価する。
CMMLUのパフォーマンスに影響を与える要因を特定する。
標準化された中国語ベンチマークで、多言語・中国語指向・中国特有LLMを比較する。
中国語LLM能力の向上に向けた実用的な方針を提供する。

提案手法

67科目・11,528問のCMMLUを使用し、各問は4択。
オープンモデルは次トークン予測、クローズドモデルは正規表現抽出を用いたゼロショットと5ショット設定で評価。
商用・オープンソースを含む幅広いモデルを、ファウンデーション／SFT／RLHFなどのサイズとトレーニングパラダイムで比較。
思考連鎖プロンプト、few-shotデモンストレーション、モデルサイズ、否定表現処理、サブオプションが性能に与える影響を分析。
科目別・カテゴリ別の性能分析を提供し、強みと弱みを解釈する。

実験結果

リサーチクエスチョン

RQ1中国語の多タスク知識ベンチマークでの広範なLLM群の性能はどの程度か。
RQ2思考連鎖プロンプティング、few-shot例、モデルサイズといった要因がCMMLUの結果にどう影響するか。
RQ3現在のLLMにとって最も難しい科目は何で、中国固有のトピックはどうか。
RQ4否定とサブオプション形式はCMMLUのモデル精度に有意な影響を及ぼすか。

主な発見

モデル	状態	STEM	人文学	社会科学	その他	中国関連	平均
GPT4	Chat	65.23	72.11	72.06	74.79	66.12	70.95
ChatGPT	Chat	47.81	55.68	56.50	62.66	50.69	55.51
LLaMA2-70B*	Base	44.11	57.05	55.63	56.65	48.01	53.21
Falcon-40B	Base	33.33	43.46	44.28	44.75	39.46	41.45
LLaMA-65B	Base	34.47	40.24	41.55	42.88	37.00	39.80
LLaMA2-13B*	Base	33.04	39.73	38.45	42.54	35.67	38.24
BLOOMZ-7B	Chat	30.56	39.10	38.59	40.32	37.15	37.04
LLaMA-30B	Base	29.69	33.68	34.08	37.40	30.68	33.63
LLaMA2-7B*	Base	30.03	34.76	33.72	33.62	30.12	32.96
ZH ${}_{\text{LLaMA}}$-13B	Chat	27.12	33.18	34.87	35.10	32.97	32.63
BX ${}_{\text{LLaMA}}$-13B	Chat	27.50	32.47	32.33	35.77	31.64	31.90
LLaMA-13B	Base	29.21	30.96	31.74	33.07	30.86	31.24
Baichuan2-13B*	Base	48.36	67.44	66.40	65.94	63.48	61.92
Baichuan-13B*	Base	42.38	61.61	60.44	59.26	56.62	55.82
InternLM-20B*	Chat	42.70	60.51	58.00	57.62	54.72	54.52
Xverse-13B*	Chat	41.65	55.72	57.47	57.32	52.32	53.08
InternLM-7B*	Base	41.71	54.43	56.42	55.38	53.11	52.07
ChatGLM-6B	Chat	32.35	39.22	39.65	38.62	37.70	37.48
BatGPT-15B	Chat	41.68	50.14	50.78	48.68	46.93	47.88

GPT-4は評価対象モデルの中で最も高い平均精度を示し約70.95%、一方で多くのオープン多言語モデルはカテゴリによって30-55%に集まる。
ほとんどのモデルが中国語試験の60%合格ラインに達せず、改善機会が大きいことを示している。
科目間で成績が不均一で、人文学・社会科学はSTEM・中国固有トピックより概して高い。
思考連鎖プロンプトは全体のCMMLU性能を向上させることは少なく、一部のモデルでは正規表現抽出を阻害することさえある。
few-shot学習はファウンデーションモデルには有効だが、SFT/RLHFモデルには一貫して効果がない場合がある。より大きなサイズは一部ファミリー（例：LLaMA2）で改善を見せるが、限界もある。
中国固有およびSTEM科目は特に難しく、サブオプション問題は多くのモデルで精度を低下させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。