QUICK REVIEW

[论文解读] MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Yunfei Xie, Ce Zhou|arXiv (Cornell University)|Aug 6, 2024

Biomedical Text Mining and Ontologies被引用 7

一句话总结

介绍 MedTrinity-25M，这是一个大规模的多模态医学数据集，包含超过 25 百万个 image-ROI-description 三元组，以及跨 10 种模态和 65+ 种疾病的多粒度注释，通过自动化管线在没有成对文本的情况下创建，利用专家定位、RAG 和 MLLMs。

ABSTRACT

This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities with multigranular annotations for more than 65 diseases. These multigranular annotations encompass both global information, such as modality and organ detection, and local information like ROI analysis, lesion texture, and region-wise correlations. Unlike the existing multimodal datasets, which are limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and textual annotations in the form of image-ROI-description triplets without the need for any paired text descriptions. Specifically, data from over 30 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal large language models to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular textual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. We propose LLaVA-Tri by pretraining LLaVA on MedTrinity-25M, achieving state-of-the-art performance on VQA-RAD, SLAKE, and PathVQA, surpassing representative SOTA multimodal large language models. Furthermore, MedTrinity-25M can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain. We will make our dataset available.

研究动机与目标

推动需要将局部 ROI 与全局疾病背景相联系的多粒度医学视觉描述的必要性。
提供一个可扩展、自动化的管线，从非配对的医学图像中生成丰富的 image-ROI-description 注释。
使广泛的多模态任务（描述、报告生成、分类、分割）以及医学 AI 模型的大规模预训练成为可能。

提出的方法

从 25M+ 的样本中跨 10 种模态和 65+ 疾病的图像-ROI-description 三元组，数据来自 90+ 个在线资源。
利用专家定位模型定位 ROI，并在需要时将掩模转换为边界框。
从 PubMed、StatPearls 和教材中构建医学知识库，并用 Faiss 进行检索增强生成的索引。
提示一个医学领域的 LLM 栈（GPT-4V 子集 → LLaVA-Med Captioner，结合 LLAMA3 的增强与多尺度特征）以在粗略描述、ROI 和检索知识的引导下生成多粒度文本描述。
在 MedTrinity-25M 上微调 LLaVA-Med++，以产生完整的 25M image-ROI-description 三元组。

实验结果

研究问题

RQ1能否利用自动定位、检索增强生成和 MLLMs 将非配对的医学图像转换为高质量的多粒度 image-ROI-description 三元组？
RQ2与现有数据集相比，多粒度注释是否提升下游多模态医学任务（如 VQA 和报告生成）的性能？
RQ3相较于未使用该数据集的模型，在 MedTrinity-25M 上进行预训练是否能在医学 VQA 基准上获得更好的结果？

主要发现

MedTrinity-25M 由来自 90+ 来源的超过 25 百万的 image-ROI-description 三元组组成，覆盖 10 种模态和 65+ 疾病。
该数据集提供了多粒度的文本描述，包括模态、器官、ROI 位置信息、区域间关系，以及 ROI 级别的边界框或掩模。
在 SLAKE 和 MIMIC-CXR 上对 GPT-4V 的对齐评估显示与人工注释的高度一致（五项标准评分为 8.2/10 和 8.9/10，总体）。
在 MedTrinity-25M 上进行预训练的 LLaVA-Med++ 在 VQA-RAD 和 PathVQA 上取得了最先进的性能，在评估基线中在 SLAKE 上的排名第三（基于数据集）——前提是对该数据集进行了预训练。
与不使用该数据集相比，在下游 VQA 基准上，MedTrinity-25M 的预训练带来了约 10.75% 的 VQA-RAD 提升、SLAKE 提升 6.1%、PathVQA 提升 13.25%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。