QUICK REVIEW

[Paper Review] MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh|ArXiv.org|Jul 7, 2025

COVID-19 diagnosis using AI20 citations

TL;DR

MedGemma introduces medically tuned vision–language foundation models (4B multimodal and 27B text-only) built on Gemma 3, plus the MedSigLIP encoder, achieving strong medical reasoning and outperformance on several tasks, with fine-tuning further boosting domain-specific performance.

ABSTRACT

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

Motivation & Objective

Develop open, medically-tuned vision-language foundation models to accelerate healthcare AI research and deployment.
Demonstrate medical understanding and reasoning across images and text, approaching task-specific models in generality.
Assess out-of-distribution performance and the benefits of fine-tuning on subdomains like radiology and histopathology.
Introduce MedSigLIP as a medically tuned vision encoder powering MedGemma.
Provide guidance and resources for downloading and using MedGemma model weights.

Proposed method

Build MedGemma variants on the Gemma 3 architecture with a 4B multimodal and a 27B text-only model.
Incorporate a SigLIP-400M vision encoder shared across Gemma sizes for 896x896 input resolution.
Pretrain using a mixture of general and medical data, with a medical-focused pretraining phase to adapt the vision-language alignment.
Apply post-training via distillation with medical text data and reinforcement learning on medical image–text data to surface capabilities.
Fine-tune on subdomains (e.g., chest X-ray reporting, histopathology, electronic health record retrieval) to improve domain-specific tasks.
Release MedSigLIP 400M (image encoder) with a 448x448 variant and provide tutorials and weights for download.

Experimental results

Research questions

RQ1How does MedGemma perform on medical text QA benchmarks relative to base Gemma 3 models of the same size?
RQ2What are the gains of MedGemma in medical image understanding and multimodal reasoning, especially on out-of-distribution tasks?
RQ3Can fine-tuning MedGemma on subdomains improve performance in radiology, dermatology, and histopathology tasks?
RQ4How does the MedSigLIP image encoder contribute to medical visual understanding compared to specialized encoders?
RQ5What is the performance trade-off on general-purpose benchmarks when specializing MedGemma for medical tasks?

Key findings

MedGemma 4B shows strong Vision Question Answering performance compared to prior SOTA models despite being smaller.
MedGemma 4B and 27B are competitive on challenging text-only medical benchmarks (e.g., MedQA, MedMCQA, PubMedQA, MMLU Med, AfriMed-QA, AgentClinic) against open models of similar scale.
MedGemma achieves 2.6-10% improvements on medical multimodal QA, 15.5-18.1% improvements on chest X-ray finding classification, and 10.8% improvement on agentic evaluations for out-of-distribution tasks relative to base models.
Fine-tuning MedGemma on subdomains reduces electronic health record information retrieval errors by 50% and reaches comparable performance to state-of-the-art methods for pneumothorax classification and histopathology patch type classification.
MedSigLIP (the medical image encoder) achieves performance comparable to or better than specialized medical image encoders, enabling efficient medical image understanding when used with MedGemma.
The MedGemma collection provides a strong medical image and text foundation with potential to accelerate medical research and downstream applications.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.