QUICK REVIEW

[论文解读] Towards Expert-Level Medical Question Answering with Large Language Models

Karan Singhal, Tao Tu|arXiv (Cornell University)|May 16, 2023

Artificial Intelligence in Healthcare and Education被引用 332

一句话总结

Med-PaLM 2 通过利用 PaLM 2、领域特定微调和集成精炼，超越了以往的医疗问答模型，在多个基准测试上达到最新水平，并在对长篇回答的人类评估中获得有利评价。

ABSTRACT

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

研究动机与目标

旨在利用大型语言模型提升医疗问答中接近医生水平的性能。
在多领域医学问答基准和现实世界风格的长篇问题上评估模型。
开发并验证提示策略，以提升医学推理能力和安全性。

提出的方法

以 PaLM 2 作为基础大语言模型。
通过对 MultiMedQA 数据集（MedQA、MedMCQA、HealthSearchQA、LiveQA、MedicationQA）进行指令微调，在医学领域数据上进行微调。
引入集合精炼提示，以聚合多条推理路径并改进答案。
使用多种提示策略进行评估：few-shot、chain-of-thought、self-consistency 与 ensemble refinement。
对长篇回答和对抗性数据集进行广泛的人类评估（医生与普通评估者）。
分析测试集与训练数据的重叠，以评估潜在的数据污染。

Figure 1: Med-PaLM 2 performance on MultiMedQA Left: Med-PaLM 2 achieved an accuracy of 86.5% on USMLE-style questions in the MedQA dataset. Right: In a pairwise ranking study on 1066 consumer medical questions, Med-PaLM 2 answers were preferred over physician answers by a panel of physicians across

实验结果

研究问题

RQ1Med-PaLM 2 是否能在标准医疗问答基准上达到或超过医生水平的表现？
RQ2领域特定微调和高级提示策略是否提升长篇回答中的医学推理与安全性？
RQ3在对抗性或关注公平性的问题面前，模型输出的鲁棒性如何？
RQ4训练/测试重叠对报告的基准性能有何影响？

主要发现

Med-PaLM 2 在 MedQA USMLE 风格问题上达到最高 86.5% 的准确率，较 Med-PaLM 提升超过 19%。
Med-PaLM 2 在 MedMCQA、PubMedQA 和 MMLU 临床主题方面接近或超越最新技术水平。
在长篇评估中，医生在九个临床效用维度中的八个上更偏好 Med-PaLM 2 相较于 Med-PaLM；普通评估者认为 Med-PaLM 2 更有帮助且相关性更强。
对抗性数据集显示 Med-PaLM 2 在安全性和局限性探测的所有维度上显著优于 Med-PaLM（例如降低伤害风险、与医学共识更好对齐）。
一种简单的集合精炼提示策略在多项选择基准上显著提升相对于基线的 few-shot 和 self-consistency 的表现（例如 MedQA 和 MMLU 变体）。
重叠分析表明测试-训练数据污染有限但不可忽视，对报告的性能有适度影响。

Figure 2: Illustration of Ensemble Refinement (ER) with Med-PaLM 2. In this approach, an LLM is conditioned on multiple possible reasoning paths that it generates to enable it to refine and improves its answer.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。