QUICK REVIEW

[论文解读] Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

Shefayat E Shams Adib, Ahmed Alfey Sani|arXiv (Cornell University)|Feb 16, 2026

Artificial Intelligence in Healthcare and Education被引用 0

一句话总结

简要结论：该论文在 iCliniq 数据集上对五个大模型进行零样本医学问答基准测试，比较自动评估指标（BLEU/ROUGE）与 LLM-as-a-Judge 评估，以衡量医学准确性与安全性。较大模型表现更好，Llama 3.3 70B Instruct 位居领先。

ABSTRACT

Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.

研究动机与目标

提供一个面向医学问答的现代大模型在大型真实数据集（iCliniq）上的全面零样本基准测试。
评估模型规模/架构与医学问答性能之间的相关性。
引入并验证一个标准化的双评估框架，结合自动指标与 LLM-as-a-Judge 的临床质量评估。
为临床场景的准确性与资源约束之间的部署提供平衡指导。

提出的方法

使用带有标准化医学提示的五个大模型的零样本评估协议。
在 38,000 的 iCliniq 医学问答数据集中选取 3,000 题进行评估。
计算 BLEU 和 ROUGE 指标，以评估词汇相似性与覆盖率。
应用 LLM-as-a-Judge 框架（Claude Sonnet 4）对医学准确性、完整性、安全性、清晰度和有用性进行打分，采用 5 点量表，权重为 30/25/20/15/10。
将结果与此前工作中的 MedLM 基线进行比较，以提供对改进的情境化理解。

实验结果

研究问题

RQ1五个当代大模型在使用 iCliniq 数据集的零样本医学问答任务中的表现如何？
RQ2模型规模/架构与零样本设置下的医学问答性能之间的关系如何？
RQ3LLM-as-a-Judge 评估与传统的 BLEU/ROUGE 指标在医学问答中的一致性如何？
RQ4可以为高准确性临床环境与资源受限环境推导出哪些部署建议？

主要发现

Model	BLEU-1	BLEU-4	ROUGE-1	ROUGE-2	ROUGE-L
Llama-3-8B-Instruct	0.1739	0.0127	0.2419	0.0379	0.1219
Llama 3.2 3B	0.2012	0.0122	0.2588	0.0355	0.1258
Llama 3.3 70B Instruct	0.2207	0.0141	0.2761	0.0404	0.1306
Llama-4-Maverick 17B 128E Instruct	0.2089	0.0132	0.2597	0.0381	0.1260
GPT-5-mini	0.0124	0.0065	0.2024	0.0290	0.0914

Llama 3.3 70B Instruct 在评估的模型中表现出最高的 BLEU-1、ROUGE-1 和 ROUGE-L。
Llama-4-Maverick 17B 展示出竞争性的效率，在参数显著减少的情况下接近 70B 模型的表现。
GPT-5-mini 在自动指标上表现较差，可能反映了实现/配置问题。
模型规模与医学问答性能之间存在明显的正相关关系，结构创新使较小模型也能接近更大模型。
LLM-as-a-Judge 结果与自动指标一致，支撑排序并验证评估框架。
医学准确性在顶尖模型处达到最高（4.83/5），而安全性在 GPT-5-mini 处最高（3.80/5），尽管词汇指标较弱。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。