Skip to main content
QUICK REVIEW

[论文解读] MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Suhana Bedi, Hejie Cui|ArXiv.org|May 26, 2025
Topic Modeling被引用 9
一句话总结

MedHELM 开发了一个经临床医生验证的医学任务分类法,和 35 项基准,用于全面评估 LLM 在真实世界医疗任务上的表现;包括成本-性能分析以及与临床医生判断对齐的 LLM-陪审团评估。

ABSTRACT

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.

研究动机与目标

  • Develop a clinician-validated taxonomy of medical tasks spanning five categories, 22 subcategories, and 121 tasks.
  • Create a benchmark suite covering all taxonomy elements, including open- and closed-ended tasks from public and private data.
  • Systematically compare frontier LLMs using real-world task benchmarks and a novel LLM-jury evaluation.
  • Assess cost-performanc e trade-offs to inform deployment decisions in healthcare settings.
  • Provide an open, extensible framework and leaderboard to enable ongoing, reproducible medical LLM evaluation.

提出的方法

  • Co-developed a taxonomy with 29 clinicians and validated high agreement (96.7%) for mapping subcategories to top-level categories.
  • Constructed 35 benchmarks (17 existing, 5 reformulated, 13 new; 12 open EHR-based) covering 22 subcategories.
  • Uniform prompting and decoding across nine frontier LLMs; used exact-match for closed-ended benchmarks and an LLM-jury ensemble for open-ended benchmarks.
  • LLM-jury evaluates open-ended outputs with three models (GPT-4o, Claude 3.7 Sonnet, LLaMA 3.3 70B) scoring on accuracy, completeness, and clarity (1–5 Likert), averaged over judges.
  • Clinician ratings validate LLM-jury against gold standards via intraclass correlation (ICC) comparisons.
  • Cost-perf ormance analysis using publicly listed pricing to estimate total evaluation costs across benchmarks.

实验结果

研究问题

  • RQ1How well does a clinician-validated taxonomy map real-world medical tasks to meaningful evaluation categories?
  • RQ2Can a 35-benchmark suite provide comprehensive coverage of medical tasks beyond licensing exams?
  • RQ3How do frontier LLMs compare on real-world medical tasks when evaluated with task-specific metrics and LLM-jury scores?
  • RQ4What are the cost implications of deploying various LLMs for medical tasks?
  • RQ5Does the LLM-jury approach align with clinician ratings better than traditional automated metrics?

主要发现

Model (snapshot)Win-rate ↑Win SD ↓Macro-avg ↑SD ↓
DeepSeek R10.660.100.750.22
o3-mini (2025-01-31)0.640.160.770.18
Claude 3.7 Sonnet (20250219)0.640.130.730.21
Claude 3.5 Sonnet (20241022)0.630.140.730.21
GPT-4o (2024-05-13)0.570.170.730.18
Gemini 2.0 Flash0.420.170.700.21
GPT-4o mini (2024-07-18)0.390.180.710.20
Llama 3.3 Instruct (70B)0.300.130.690.22
Gemini 1.5 Pro (001)0.240.080.670.21
  • Reasoning models (DeepSeek R1, o3-mini) achieve the highest pairwise win-rates (0.66 and 0.64).
  • Claude 3.5 Sonnet offers competitive results at about 40% lower estimated cost.
  • Most models perform best in Clinical Note Generation (0.74–0.85) and Patient Communication & Education (0.76–0.89).
  • Moderate performance in Medical Research Assistance (0.65–0.75) and Clinical Decision Support (0.61–0.76).
  • Administration & Workflow tasks are comparatively weaker (0.53–0.63).
  • LLM-jury ICC with clinician ratings is 0.47, outperforming ROUGE-L (0.36) and BERTScore-F (0.44) and the average clinician–clinician agreement (0.43).

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。