QUICK REVIEW

[論文レビュー] Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Junling Liu, Peilin Zhou|arXiv (Cornell University)|Jun 5, 2023

Topic Modeling被引用数 32

ひとこと要約

この論文は CMExam を紹介する。60k件超の中国語医療試験Datasetと解説を含み、回答予測と推論でさまざまなLLMをベンチマークする。GPT-4 は評価対象モデルの中でゼロショット精度が最高だが、人間の性能にはまだ及ばない。

ABSTRACT

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at https://github.com/williamliujl/CMExam.

研究の動機と目的

標準化された大規模な中国語医療QAベンチマークの必要性を喚起する。
客観的評価を可能にするため、本物の CNMLE の問題から CMExam を作成する。
モデルの推論と知識の網羅性を調べるため、質問ごとの豊富な注釈を提供する。
専門家の検証を伴うラベリングを拡張するため、GPTを用いたアノテーションを実演する。
予測と推論のタスクの両方において、一般領域および医療領域のLLMのベースライン比較を提供する。

提案手法

非テキスト項目を除外した CNMLE 問題から CMExam を構築する。
五つの追加注釈を提供する：ICD-11 病名群、DMIDTC 臨床部門、医療分野、医療能力、そして問題難易度（人間のパフォーマンスに基づく）。
GPT-4 を用いて注釈をブートストラップし、人間の検証を行う。
二つのタスクでLLMを評価する：回答予測（多択）と回答推論（オープンエンドの説明）。
P-tuning V2 (ChatGLM-6B) および LoRA (LLaMA/Alpaca/Vicuna/Huatuo/MedAlpaca) を用いて CMExam 上でオープンモデルをファインチューニングする。
予測には accuracy および weighted F1、説明には BLEU と ROUGE を用いて評価する。

実験結果

リサーチクエスチョン

RQ1国家試験に由来する中国語の医療分野の多肢選択問題において、最先端のLLMはどの程度の性能を示すか？
RQ2CMExam でLLMをファインチューニングすると回答の正確性と推論の質の両方が向上しますか？
RQ3中国語の医療QAにおける一般領域と医療領域のLLMの長所と限界は何か？
RQ4疾病群、部門、学科、能力、難易度のレベルごとにモデルの性能はどのように変化するか？
RQ5医療QAタスクにおけるLLMと人間の専門家の間にはどのようなギャップが残っているか？

主な発見

モデル種別	モデル	サイズ	精度 (%)	F1 (%)	BLEU-1	BLEU-4	ROUGE-1	ROUGE-2	ROUGE-L
General Domain	GPT-3.5-turbo	175B	46.4±0.6	46.1±0.7	3.56±0.67	1.49±0.51	33.80±0.19	16.39±0.18	14.83±0.13
General Domain	GPT-4	-	61.6±0.1	61.7±0.1	0.17±0.00	0.06±0.00	29.74±0.09	14.84±0.04	11.51±0.03
General Domain	ChatGLM	6B	26.3±0.0	25.7±0.1	16.51±0.08	5.00±0.06	35.18±0.11	15.73±0.05	17.09±0.13
General Domain	LLaMA	7B	0.4±0.0	0.3±0.0	11.99±0.03	5.70±0.0	27.33±0.06	11.88±0.03	10.78±0.04
General Domain	Vicuna	7B	5.0±0.0	4.8±0.1	20.15±0.01	9.26±0.01	38.43±0.02	16.90±0.01	16.33±0.01
General Domain	Alpaca	7B	8.5±0.0	8.4±0.0	4.75±0.00	2.50±0.00	22.52±0.00	9.54±0.00	8.40±0.00
Medical Domain	Huatuo	7B	12.9±0.0	7.0±0.0	0.21±0.00	0.12±0.00	25.11±0.08	11.56±0.04	9.73±0.02
Medical Domain	MedAlpaca	7B	20.0±0.0	10.7±0.0	0.00±0.00	0.00±0.00	1.90±0.00	0.04±0.00	0.52±0.03
Medical Domain	DoctorGLM	6B	-	-	9.43±0.09	2.65±0.03	21.11±0.03	6.86±0.01	9.99±0.06
Medical Domain	PromptCLUE-base-CMExam	0.1B	-	-	18.75±0.08	6.65±0.05	40.88±0.11	21.90±0.11	18.31±0.11
Medical Domain	Bart-base-chinese-CMExam	0.1B	-	-	23.00±0.40	10.35±0.16	44.33±0.09	24.29±0.09	20.80±0.09
Medical Domain	Bart-large-chinese-CMExam	0.1B	-	-	26.37±0.18	11.65±0.08	44.92±0.12	24.34±0.12	21.75±0.03
Medical Domain	BERT-CMExam	0.1B	31.8±0.2	31.2±0.2	-	-	-	-	-
Medical Domain	RoBERTa-CMExam	0.3B	37.1±0.1	36.7±0.4	-	-	-	-	-
Medical Domain	MedAlpaca-CMExam	7B	30.5±0.1	30.4±0.1	16.35±0.80	9.78±0.47	44.31±0.85	27.05±0.50	24.55±0.43
Medical Domain	Huatuo-CMExam	7B	28.6±0.5	29.3±0.2	29.04±0.01	16.72±0.03	43.85±0.24	25.36±0.22	21.72±0.24
Medical Domain	ChatGLM-CMExam	6B	45.3±1.4	45.2±1.4	31.10±0.23	18.94±0.12	43.94±0.28	31.48±0.14	29.39±0.14
Medical Domain	LLaMA-CMExam	7B	18.3±0.5	20.6±0.5	29.25±0.23	16.46±0.10	45.88±0.04	26.57±0.04	23.31±0.02
Medical Domain	Alpaca-CMExam	7B	21.1±0.6	24.9±0.4	29.57±0.10	16.40±0.12	45.48±0.12	25.53±0.18	22.97±0.06
Medical Domain	Vicuna-CMExam	7B	27.3±0.5	28.2±0.3	29.82±0.03	17.30±0.01	44.98±0.16	26.25±0.13	22.44±0.09
Baseline	Random	-	3.1±0.2	5.1±0.3	-	-	-	-	-
Human Performance	Human volunteers	-	71.6	-	-	-	-	-	-

GPT-4 は評価対象モデルの中で最高のゼロショット精度を達成し、予測61.6%、F1 61.7% だが、人間の正解率は71.6%である。
ファインチューニングされたモデル（例：ChatGLM-CMExam）は、パラメータがはるかに少ないにもかかわらず、GPT-3.5と同等の精度に達する（例：45.3% vs 46.4% in some setups）ことから、回答予測にはファインチューニングが大きな効果を示す。
医療ドメインのLLMは、狭い医療コーパスのためゼロショット性能が限られる；CMExamでのファインチューニングは推論品質を向上させるが、説明のBLEUスコアは依然低い。
CMExamでファインチューニングされた軽量モデルは、回答予測で GPT-3.5 に近づき、推論で場合によっては優れる。一方、エンコーダーのみのモデル（BERT/RoBERTa）は依然として競合のベースライン。
GPT 系は短い説明を生成するためBLEUが低いがROUGEは相対的に高い。ファインチューニングによりより妥当な説明になる。
疾病群・臨床部門・医療分野間で性能には大きなばらつきがあり、一般領域で最も高い精度、ニッチな領域で低い。
総じて CMExam は医療QAを客観的に評価可能とし、LLMが人間の性能にまだ及ばない領域を浮き彫りにする。特に医療の基礎と特定の専門分野で。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。