QUICK REVIEW

[论文解读] MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh|ArXiv.org|Jul 7, 2025

COVID-19 diagnosis using AI被引用 20

一句话总结

MedGemma 引入经过医学调校的视觉–语言基础模型（4B 多模态和 27B 纯文本），基于 Gemma 3 构建，加上 MedSigLIP 编码器，在多项任务上展现强劲医疗推理能力并优于同等规模模型，微调进一步提升领域特定性能。

ABSTRACT

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

研究动机与目标

开发开放的、经过医学调校的视觉–语言基础模型，以加速医疗 AI 研究与部署。
展示跨图像与文本的医疗理解与推理能力，在通用性上接近面向具体任务的模型。
评估分布外性能以及对放射学与组织病理学等子领域微调的收益。
引入 MedSigLIP 作为为 MedGemma 提供支撑的医学调校视觉编码器。
提供下载与使用 MedGemma 模型权重的指南与资源。

提出的方法

在 Gemma 3 架构上构建 MedGemma 变体，包含一个 4B 多模态模型和一个 27B 纯文本模型。
纳入 SigLIP-400M 视觉编码器，在 Gemma 尺度之间共享，输入分辨率为 896x896。
使用一般数据与医疗数据混合进行预训练，设有聚焦医疗的预训练阶段以调整视觉–语言对齐。
通过蒸馏结合医疗文本数据进行后训练，并对医疗图像–文本数据进行强化学习以挖掘能力。
在子领域（如 Chest X-ray 报告、 histopathology、电子健康记录检索）上进行微调，以提升领域特定任务。
发布 MedSigLIP 400M（图像编码器）并提供 448x448 变体，以及下载的教程和权重。

实验结果

研究问题

RQ1相同规模的 Gemma 3 基线模型相比，MedGemma 在医疗文本问答基准上的表现如何？
RQ2MedGemma 在医疗图像理解和多模态推理方面的提升有哪些，尤其是在分布外任务上？
RQ3对 MedGemma 进行子领域微调是否能提升放射学、皮肤病学和组织病理学任务的表现？
RQ4与专用编码器相比，MedSigLIP 图像编码器对医学视觉理解的贡献如何？
RQ5当将 MedGemma 专用于医疗任务时，在通用基准上的性能权衡如何？

主要发现

MedGemma 4B 展示出强劲的 Vision Question Answering 性能，尽管体积较小，仍优于此前的 SOTA 模型。
MedGemma 4B 与 27B 在具有挑战性的文本诊断基准（如 MedQA、MedMCQA、PubMedQA、MMLU Med、AfriMed-QA、AgentClinic）上，与同规模的开放模型相比具竞争力。
相对于基线模型，MedGemma 在医疗多模态问答上提升 2.6-10%，在胸部X光发现分类上提升 15.5-18.1%，在分布外任务的代理评估上提升 10.8%。
在子领域微调后，电子健康记录信息检索错误减少 50%，并在气胸分类和组织病理切片类型分类方面达到与最先进方法相当的表现。
MedSigLIP（医学图像编码器）达到与专用医学图像编码器相当或更好的性能，使与 MedGemma 搭配时的医学图像理解更高效。
MedGemma 系列提供了强大的医学图像与文本基础，有潜力加速医学研究及下游应用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。