QUICK REVIEW

[论文解读] MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Suhana Bedi, Hejie Cui|ArXiv.org|May 26, 2025

Topic Modeling被引用 9

一句话总结

MedHELM 开发了一个经临床医生验证的医学任务分类法，和 35 项基准，用于全面评估 LLM 在真实世界医疗任务上的表现；包括成本-性能分析以及与临床医生判断对齐的 LLM-陪审团评估。

ABSTRACT

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.

研究动机与目标

Develop a clinician-validated taxonomy of medical tasks spanning five categories, 22 subcategories, and 121 tasks.
Create a benchmark suite covering all taxonomy elements, including open- and closed-ended tasks from public and private data.
Systematically compare frontier LLMs using real-world task benchmarks and a novel LLM-jury evaluation.
Assess cost-performanc e trade-offs to inform deployment decisions in healthcare settings.
Provide an open, extensible framework and leaderboard to enable ongoing, reproducible medical LLM evaluation.

提出的方法

Co-developed a taxonomy with 29 clinicians and validated high agreement (96.7%) for mapping subcategories to top-level categories.
Constructed 35 benchmarks (17 existing, 5 reformulated, 13 new; 12 open EHR-based) covering 22 subcategories.
Uniform prompting and decoding across nine frontier LLMs; used exact-match for closed-ended benchmarks and an LLM-jury ensemble for open-ended benchmarks.
LLM-jury evaluates open-ended outputs with three models (GPT-4o, Claude 3.7 Sonnet, LLaMA 3.3 70B) scoring on accuracy, completeness, and clarity (1–5 Likert), averaged over judges.
Clinician ratings validate LLM-jury against gold standards via intraclass correlation (ICC) comparisons.
Cost-perf ormance analysis using publicly listed pricing to estimate total evaluation costs across benchmarks.

实验结果

研究问题

RQ1How well does a clinician-validated taxonomy map real-world medical tasks to meaningful evaluation categories?
RQ2Can a 35-benchmark suite provide comprehensive coverage of medical tasks beyond licensing exams?
RQ3How do frontier LLMs compare on real-world medical tasks when evaluated with task-specific metrics and LLM-jury scores?
RQ4What are the cost implications of deploying various LLMs for medical tasks?
RQ5Does the LLM-jury approach align with clinician ratings better than traditional automated metrics?

主要发现

Model (snapshot)	Win-rate ↑	Win SD ↓	Macro-avg ↑	SD ↓
DeepSeek R1	0.66	0.10	0.75	0.22
o3-mini (2025-01-31)	0.64	0.16	0.77	0.18
Claude 3.7 Sonnet (20250219)	0.64	0.13	0.73	0.21
Claude 3.5 Sonnet (20241022)	0.63	0.14	0.73	0.21
GPT-4o (2024-05-13)	0.57	0.17	0.73	0.18
Gemini 2.0 Flash	0.42	0.17	0.70	0.21
GPT-4o mini (2024-07-18)	0.39	0.18	0.71	0.20
Llama 3.3 Instruct (70B)	0.30	0.13	0.69	0.22
Gemini 1.5 Pro (001)	0.24	0.08	0.67	0.21

Reasoning models (DeepSeek R1, o3-mini) achieve the highest pairwise win-rates (0.66 and 0.64).
Claude 3.5 Sonnet offers competitive results at about 40% lower estimated cost.
Most models perform best in Clinical Note Generation (0.74–0.85) and Patient Communication & Education (0.76–0.89).
Moderate performance in Medical Research Assistance (0.65–0.75) and Clinical Decision Support (0.61–0.76).
Administration & Workflow tasks are comparatively weaker (0.53–0.63).
LLM-jury ICC with clinician ratings is 0.47, outperforming ROUGE-L (0.36) and BERTScore-F (0.44) and the average clinician–clinician agreement (0.43).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。