QUICK REVIEW

[논문 리뷰] Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Junling Liu, Peilin Zhou|arXiv (Cornell University)|2023. 06. 05.

Topic Modeling인용 수 32

한 줄 요약

이 논문은 CMExam을 소개합니다. CMExam은 60k개가 넘는 중국 의학 면허 시험 문제와 해설을 포함하는 대규모 데이터셋이며, 정답 예측 및 추론에 대해 다양한 LLM을 벤치마크합니다. GPT-4는 평가된 모델 중 제로샷 정확도에서 가장 높은 성능을 보였지만, 인간 성능보다 여전히 뒤떨어집니다.

ABSTRACT

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at https://github.com/williamliujl/CMExam.

연구 동기 및 목표

대 standardized하고 대규모의 중국어 의학 QA 벤치마크의 필요성을 제시한다.
실제 CNMLE 문제에서 CMExam을 만들어 객관적인 평가를 가능하게 한다.
모델 추론과 지식 커버리지를 연구하기 위한 문제별 주석을 제공한다.
전문가 검증과 함께 라벨링을 확장하기 위한 GPT-주도형 주석화를 시연한다.
예측 및 추론 작업 모두에서 일반 도메인 및 의학 도메인 LLM 간의 기준선을 제공한다.

제안 방법

비텍스트 항목을 제외한 CNMLE 문제로부터 CMExam을 구성한다.
추가 다섯 가지 주석: ICD-11 질환군, DMIDTC 임상과, 의학적 분야, 의학적 역량, 난이도(인간 성능 기반)를 제공한다.
주석 초안을 GPT-4로 작성하고 인간 검증을 거친다.
두 가지 작업에서 LLM을 평가한다: 정답 예측(다지선다)과 정답 추론(개방형 설명).
CMExam에 대해 P-tuning V2(ChatGLM-6B) 및 LoRA(LLaMA/Alpaca/Vicuna/Huatuo/MedAlpaca)를 사용해 오픈 모델을 미세조정한다.
정확도 및 예측에 대한 가중치 F1으로 평가; 설명에 대해서는 BLEU 및 ROUGE를 사용한다.

실험 결과

연구 질문

RQ1최신 LLM이 국가 면허 시험에서 파생된 중국어 의학 다지선다형 질문으로 어떤 성능을 보이는가?
RQ2CMExam에서 LLM을 미세조정하면 정답 정확도와 추론 품질이 모두 향상되는가?
RQ3일반 도메인 vs 의학 도메인 LLM의 중국어 의학 QA에서의 강점과 한계는 무엇인가?
RQ4질환군, 진료과, 의학 분야, 역량, 난이도에 따라 모델 성능은 어떻게 달라지는가?
RQ5의료 QA 작업에서 LLM과 인간 전문가 사이의 격차는 어디에 남아 있는가?

주요 결과

모델 타입	모델	크기	Acc (%)	F1 (%)	BLEU-1	BLEU-4	ROUGE-1	ROUGE-2	ROUGE-L
General Domain	GPT-3.5-turbo	175B	46.4±0.6	46.1±0.7	3.56±0.67	1.49±0.51	33.80±0.19	16.39±0.18	14.83±0.13
General Domain	GPT-4	-	61.6±0.1	61.7±0.1	0.17±0.00	0.06±0.00	29.74±0.09	14.84±0.04	11.51±0.03
General Domain	ChatGLM	6B	26.3±0.0	25.7±0.1	16.51±0.08	5.00±0.06	35.18±0.11	15.73±0.05	17.09±0.13
General Domain	LLaMA	7B	0.4±0.0	0.3±0.0	11.99±0.03	5.70±0.0	27.33±0.06	11.88±0.03	10.78±0.04
General Domain	Vicuna	7B	5.0±0.0	4.8±0.1	20.15±0.01	9.26±0.01	38.43±0.02	16.90±0.01	16.33±0.01
General Domain	Alpaca	7B	8.5±0.0	8.4±0.0	4.75±0.00	2.50±0.00	22.52±0.00	9.54±0.00	8.40±0.00
Medical Domain	Huatuo	7B	12.9±0.0	7.0±0.0	0.21±0.00	0.12±0.00	25.11±0.08	11.56±0.04	9.73±0.02
Medical Domain	MedAlpaca	7B	20.0±0.0	10.7±0.0	0.00±0.00	0.00±0.00	1.90±0.00	0.04±0.00	0.52±0.03
Medical Domain	DoctorGLM	6B	-	-	9.43±0.09	2.65±0.03	21.11±0.03	6.86±0.01	9.99±0.06
Medical Domain	PromptCLUE-base-CMExam	0.1B	-	-	18.75±0.08	6.65±0.05	40.88±0.11	21.90±0.11	18.31±0.11
Medical Domain	Bart-base-chinese-CMExam	0.1B	-	-	23.00±0.40	10.35±0.16	44.33±0.09	24.29±0.09	20.80±0.09
Medical Domain	Bart-large-chinese-CMExam	0.1B	-	-	26.37±0.18	11.65±0.08	44.92±0.12	24.34±0.12	21.75±0.03
Medical Domain	BERT-CMExam	0.1B	31.8±0.2	31.2±0.2	-	-	-	-	-
Medical Domain	RoBERTa-CMExam	0.3B	37.1±0.1	36.7±0.4	-	-	-	-	-
Medical Domain	MedAlpaca-CMExam	7B	30.5±0.1	30.4±0.1	16.35±0.80	9.78±0.47	44.31±0.85	27.05±0.50	24.55±0.43
Medical Domain	Huatuo-CMExam	7B	28.6±0.5	29.3±0.2	29.04±0.01	16.72±0.03	43.85±0.24	25.36±0.22	21.72±0.24
Medical Domain	ChatGLM-CMExam	6B	45.3±1.4	45.2±1.4	31.10±0.23	18.94±0.12	43.94±0.28	31.48±0.14	29.39±0.14
Medical Domain	LLaMA-CMExam	7B	18.3±0.5	20.6±0.5	29.25±0.23	16.46±0.10	45.88±0.04	26.57±0.04	23.31±0.02
Medical Domain	Alpaca-CMExam	7B	21.1±0.6	24.9±0.4	29.57±0.10	16.40±0.12	45.48±0.12	25.53±0.18	22.97±0.06
Medical Domain	Vicuna-CMExam	7B	27.3±0.5	28.2±0.3	29.82±0.03	17.30±0.01	44.98±0.16	26.25±0.13	22.44±0.09
Baseline	Random	-	3.1±0.2	5.1±0.3	-	-	-	-	-
Human Performance	Human volunteers	-	71.6	-	-	-	-	-	-

GPT-4는 평가된 모델 중 제로샷 정확도에서 가장 높은 성능을 보였으며 예측 61.6% 및 F1 61.7%를 달성하지만, 인간 정확도는 71.6%이다.
미세조정된 모델(예: ChatGLM-CMExam)은 파라미터 수가 훨씬 적은 모델에서도 GPT-3.5와 비슷한 정확도에 도달할 수 있으며(예: 45.3% vs 46.4%의 GPT-3.5의 일부 설정), 정답 예측에 대해 미세조정이 크게 도움이 됨.
의학 도메인 LLM(Huatuo, DoctorGLM 등)은 의학 코퍼스의 한정성으로 제로샷 성능이 제한적이며 CMExam에서의 미세조정은 추론 품질(BLEU/ROUGE)을 향상시키지만 설명의 BLEU 점수는 여전히 낮은 편이다.
CMExam에서 경량화된 모델은 정답 예측에서 GPT-3.5 성능에 근접하거나 일부 경우 추론에서 이를 능가할 수 있으며, 인코더-전용 모델(BERT/RoBERTa)은 여전히 경쟁력 있는 기준선이다.
GPT 계열 모델은 짧은 설명을 생성해 BLEU 점수는 낮지만 ROUGE 점수는 상대적으로 높다; 미세조정은 보다 합리적인 설명을 낳는다.
질환군, 임상과, 의학 분야에 따라 성능 편차가 크게 나타나며 일반 영역에서 가장 높은 정확도, 특정 분야에서 낮은 정확도를 보인다(예: TCMDP, TCM, 특정 분야).
전반적으로 CMExam은 의료 QA의 객관적 평가를 가능하게 하며 LLM이 아직 인간 성능을 따라가지 못하는 영역(의학 기초 지식 및 특정 전문 분야)도 밝힌다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.