[论文解读] Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset
本论文推出 CMExam,是一个包含 60k+ 道多项选择题及解析的大规模中文医学考试数据集,并在回答预测与推理方面对多种大语言模型进行基准测试。GPT-4 在评估模型中实现最佳零-shot 准确率,但仍落后于人类水平。
Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at https://github.com/williamliujl/CMExam.
研究动机与目标
- 推动建立一个标准化、规模化的中文医学问答基准的必要性。
- 从真实的 CNMLE 问题中创建 CMExam,以实现客观评估。
- 提供丰富的逐题注释,以研究模型推理和知识覆盖范围。
- 演示使用 GPT 辅助标注,通过专家验证实现标注规模化。
- 提供通用领域与医疗领域大型语言模型在预测和推理任务上的基线比较。
提出的方法
- 从 CNMLE 问题构建 CMExam,排除非文本项。
- 提供五个额外注释: ICD-11 疾病分组,DMIDTC 临床科室,医学学科,医学能力,以及基于人类表现的题目难度。
- 使用 GPT-4 进行注释自举,并经人类验证。
- 在两个任务上评估 LLMs:答案预测(多项选择)与答案推理(开放式解释)。
- 使用 P-tuning V2(ChatGLM-6B)和 LoRA(LLaMA/Alpaca/Vicuna/Huatuo/MedAlpaca)对 CMExam 上的开放模型进行微调。
- 评估使用预测的准确率和加权 F1;对解释使用 BLEU 和 ROUGE。
实验结果
研究问题
- RQ1最先进的 LLM 在来自国家执业考试的中文医学多项选择题上的表现如何?
- RQ2在 CMExam 上对 LLMs 进行微调是否能同时提升答案准确率和推理质量?
- RQ3通用领域与医疗领域 LLMs 在中文医学问答中的优点与局限性是什么?
- RQ4模型在疾病分组、科室、学科、能力和难度水平上的性能如何变化?
- RQ5在医学问答任务中,LLMs 与人类专家之间还存在哪些差距?
主要发现
| 模型类型 | 模型 | 尺寸 | Acc (%) | F1 (%) | BLEU-1 | BLEU-4 | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|---|---|---|---|---|
| General Domain | GPT-3.5-turbo | 175B | 46.4±0.6 | 46.1±0.7 | 3.56±0.67 | 1.49±0.51 | 33.80±0.19 | 16.39±0.18 | 14.83±0.13 |
| General Domain | GPT-4 | - | 61.6±0.1 | 61.7±0.1 | 0.17±0.00 | 0.06±0.00 | 29.74±0.09 | 14.84±0.04 | 11.51±0.03 |
| General Domain | ChatGLM | 6B | 26.3±0.0 | 25.7±0.1 | 16.51±0.08 | 5.00±0.06 | 35.18±0.11 | 15.73±0.05 | 17.09±0.13 |
| General Domain | LLaMA | 7B | 0.4±0.0 | 0.3±0.0 | 11.99±0.03 | 5.70±0.0 | 27.33±0.06 | 11.88±0.03 | 10.78±0.04 |
| General Domain | Vicuna | 7B | 5.0±0.0 | 4.8±0.1 | 20.15±0.01 | 9.26±0.01 | 38.43±0.02 | 16.90±0.01 | 16.33±0.01 |
| General Domain | Alpaca | 7B | 8.5±0.0 | 8.4±0.0 | 4.75±0.00 | 2.50±0.00 | 22.52±0.00 | 9.54±0.00 | 8.40±0.00 |
| Medical Domain | Huatuo | 7B | 12.9±0.0 | 7.0±0.0 | 0.21±0.00 | 0.12±0.00 | 25.11±0.08 | 11.56±0.04 | 9.73±0.02 |
| Medical Domain | MedAlpaca | 7B | 20.0±0.0 | 10.7±0.0 | 0.00±0.00 | 0.00±0.00 | 1.90±0.00 | 0.04±0.00 | 0.52±0.03 |
| Medical Domain | DoctorGLM | 6B | - | - | 9.43±0.09 | 2.65±0.03 | 21.11±0.03 | 6.86±0.01 | 9.99±0.06 |
| Medical Domain | PromptCLUE-base-CMExam | 0.1B | - | - | 18.75±0.08 | 6.65±0.05 | 40.88±0.11 | 21.90±0.11 | 18.31±0.11 |
| Medical Domain | Bart-base-chinese-CMExam | 0.1B | - | - | 23.00±0.40 | 10.35±0.16 | 44.33±0.09 | 24.29±0.09 | 20.80±0.09 |
| Medical Domain | Bart-large-chinese-CMExam | 0.1B | - | - | 26.37±0.18 | 11.65±0.08 | 44.92±0.12 | 24.34±0.12 | 21.75±0.03 |
| Medical Domain | BERT-CMExam | 0.1B | 31.8±0.2 | 31.2±0.2 | - | - | - | - | - |
| Medical Domain | RoBERTa-CMExam | 0.3B | 37.1±0.1 | 36.7±0.4 | - | - | - | - | - |
| Medical Domain | MedAlpaca-CMExam | 7B | 30.5±0.1 | 30.4±0.1 | 16.35±0.80 | 9.78±0.47 | 44.31±0.85 | 27.05±0.50 | 24.55±0.43 |
| Medical Domain | Huatuo-CMExam | 7B | 28.6±0.5 | 29.3±0.2 | 29.04±0.01 | 16.72±0.03 | 43.85±0.24 | 25.36±0.22 | 21.72±0.24 |
| Medical Domain | ChatGLM-CMExam | 6B | 45.3±1.4 | 45.2±1.4 | 31.10±0.23 | 18.94±0.12 | 43.94±0.28 | 31.48±0.14 | 29.39±0.14 |
| Medical Domain | LLaMA-CMExam | 7B | 18.3±0.5 | 20.6±0.5 | 29.25±0.23 | 16.46±0.10 | 45.88±0.04 | 26.57±0.04 | 23.31±0.02 |
| Medical Domain | Alpaca-CMExam | 7B | 21.1±0.6 | 24.9±0.4 | 29.57±0.10 | 16.40±0.12 | 45.48±0.12 | 25.53±0.18 | 22.97±0.06 |
| Medical Domain | Vicuna-CMExam | 7B | 27.3±0.5 | 28.2±0.3 | 29.82±0.03 | 17.30±0.01 | 44.98±0.16 | 26.25±0.13 | 22.44±0.09 |
| Baseline | Random | - | 3.1±0.2 | 5.1±0.3 | - | - | - | - | - |
| Human Performance | Human volunteers | - | 71.6 | - | - | - | - | - | - |
- GPT-4 在评估模型中实现最高的零-shot 准确率,分别为 61.6%( prediction )和 61.7% 的 F1,但人类准确率为 71.6%。
- Finetuned models (e.g., ChatGLM-CMExam) reach accuracy comparable to GPT-3.5 with far fewer parameters (e.g., 45.3% vs 46.4% for GPT-3.5 in some setups), showing finetuning helps significantly for answer prediction.
- Medical-domain LLMs (e.g., Huatuo, DoctorGLM) show limited zero-shot performance due to narrow medical corpora; finetuning on CMExam improves reasoning quality (BLEU/ROUGE) but BLEU scores remain low for explanations.
- Lightweight models finetuned on CMExam can approach GPT-3.5 performance on answer prediction and even outperform on reasoning in some cases, while encoder-only models (BERT/RoBERTa) remain competitive baselines.
- GPT models generate short explanations, resulting in lower BLEU scores but relatively higher ROUGE scores; finetuning yields more reasonable explanations.
- There is substantial variation in performance across disease groups, clinical departments, and medical disciplines, with highest accuracy in common areas and lower accuracy in niche domains (e.g., TCMDP, TCM, certain disciplines).
- Overall, CMExam enables objective evaluation of medical QA and highlights areas where LLMs still lag human performance, particularly in medical fundamentals and certain specialties.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。