QUICK REVIEW

[论文解读] Med-Flamingo: a Multimodal Medical Few-shot Learner

Michael Moor, Qian Huang|arXiv (Cornell University)|Jul 27, 2023

Multimodal Machine Learning Applications被引用 45

一句话总结

Med-Flamingo 将 Flamingo 适配到医疗领域，以实现面向生成的医疗多模态少样本学习，用于多模态医疗 VQA；通过临床医生评定的开放式回答和一个新的 Visual USMLE 数据集进行评估。

ABSTRACT

Medicine, by its nature, is a multifaceted domain that requires the synthesis of information across various modalities. Medical generative vision-language models (VLMs) make a first step in this direction and promise many exciting clinical applications. However, existing models typically have to be fine-tuned on sizeable down-stream datasets, which poses a significant limitation as in many medical applications data is scarce, necessitating models that are capable of learning from few examples in real-time. Here we propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. Med-Flamingo unlocks few-shot generative medical visual question answering (VQA) abilities, which we evaluate on several datasets including a novel challenging open-ended VQA dataset of visual USMLE-style problems. Furthermore, we conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app. Med-Flamingo improves performance in generative medical VQA by up to 20\% in clinician's rating and firstly enables multimodal medical few-shot adaptations, such as rationale generation. We release our model, code, and evaluation app under https://github.com/snap-stanford/med-flamingo.

研究动机与目标

在医疗领域激发并实现具少样本能力的多模态上下文学习。
在教科书和 PubMed 来源的交错医学图文数据上对医学适应的视觉-语言模型进行预训练。
通过临床医生的人类评估，展示生成式医疗 VQA 及推理生成能力。
创建并发布一个跨多学科的新型 Visual USMLE 风格的 VQA 数据集。

提出的方法

通过在交错的医学图像-文本数据（ MTB 数据集）和成对的 PMC-OA 数据上继续对 OpenFlamingo-9B 进行预训练来构建 Med-Flamingo。
采用联合目标函数，对成对数据和交错数据进行训练，λ 设为 1。
使用临床为基础的评估指标，在 VQA-RAD、PathVQA 和 Visual USMLE 上进行少样本生成式医疗 VQA 评估。
进行盲评人类评估应用，临床医生在 0–10 的临床有用性量表上对生成结果进行评分。
通过使用 Vision Transformer 表达和 FAISS，识别预训练与评估集之间的视觉相似图像，进行去重和泄漏移除。

实验结果

研究问题

RQ1是否可以将视觉-语言模型迁移到医学领域，通过少样本提示实现多模态的上下文学习？
RQ2生成式医疗 VQA 的输出（含推理理由）是否与跨多种医疗模态和学科的临床医生判断一致？
RQ3基于教科书和 PMC 的医学领域预训练数据集，是否相较于通用领域基线提升少样本 VQA 的性能？
RQ4少样本提示对生成的推理理由和诊断的真实感及临床有用性产生怎样的影响？
RQ5一个新的 Visual USMLE 风格数据集是否能够实质性地挑战并评估超越放射学/病理学任务的多模态医疗 VQA？

主要发现

在少样本设置下，Med-Flamingo 在三个生成式医疗 VQA 数据集上获得最佳平均临床评定分数，较基线提升高达 20%。
该模型能够对复杂问题进行推理并生成推理理由，这是先前的多模态医疗基础模型所未展示的能力。
Visual USMLE 提供跨学科的多模态问题，辅以图像、病例简述和化验结果，扩展超出放射学和病理数据集。
去重揭示了 PVQA 测试图像来自预训练数据的泄漏；移除 194 张高度相似的图像以确保评估完整性。
在 VQA-RAD 与 PathVQA 上，自动化指标（BERT-sim、Exact-match）未能可靠地反映临床有用性，强调了人工评估的重要性。
Med-Flamingo 展示出强劲的少样本性能，同时存在安全性提示和潜在幻觉的问题，凸显需要进一步的数据与对齐工作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。