QUICK REVIEW

[论文解读] Exploring scalable medical image encoders beyond text supervision

Fernando Pérez‐García, Harshita Sharma|arXiv (Cornell University)|Jan 19, 2024

Artificial Intelligence in Healthcare and Education被引用 8

一句话总结

论文展示了一种仅图像的自监督生物医学图像编码器（raddino），在DINOv2和掩蔽图像建模训练下，能够在分类、分割和视觉-语言任务上匹配或超越文本监督模型，同时随着数据扩展并与临床信息的相关性增加。

ABSTRACT

Language-supervised pre-training has proven to be a valuable method for extracting semantically meaningful features from images, serving as a foundational element in multimodal systems within the computer vision and medical imaging domains. However, the computed features are limited by the information contained in the text, which is particularly problematic in medical imaging, where the findings described by radiologists focus on specific observations. This challenge is compounded by the scarcity of paired imaging-text data due to concerns over leakage of personal health information. In this work, we fundamentally challenge the prevailing reliance on language supervision for learning general-purpose biomedical imaging encoders. We introduce RAD-DINO, a biomedical image encoder pre-trained solely on unimodal biomedical imaging data that obtains similar or greater performance than state-of-the-art biomedical language-supervised models on a diverse range of benchmarks. Specifically, the quality of learned representations is evaluated on standard imaging tasks (classification and semantic segmentation), and a vision-language alignment task (text report generation from images). To further demonstrate the drawback of language supervision, we show that features from RAD-DINO correlate with other medical records (e.g., sex or age) better than language-supervised models, which are generally not mentioned in radiology reports. Finally, we conduct a series of ablations determining the factors in RAD-DINO's performance; notably, we observe that RAD-DINO's downstream performance scales well with the quantity and diversity of training data, demonstrating that image-only supervision is a scalable approach for training a foundational biomedical image encoder. Model weights of RAD-DINO trained on publicly available datasets are available at https://huggingface.co/microsoft/rad-dino.

研究动机与目标

推动减少对语言监督在生物医学图像编码器中的依赖，因为图像-文本数据有限以及PHI相关问题。
提出 raddino，一种仅图像的编码器，使用 DINOv2 与 masked image modelling (MIM) 进行全局与局部特征学习。
在图像分类、语义分割和文本报告生成上评估 raddino，以测试单模态和多模态能力。
证明仅图像表示在与患者人口统计信息和类似电子健康记录的信息相关性方面，可能比语言监督模型更强。

提出的方法

使用 DINOv2 的混合目标对 raddino 进行预训练：掩蔽图像建模（MIM）用于补丁级预测，以及带多裁剪视图的图像级对比学习。
以 DINOv2 ViT-B 为起点，在大规模、多样化的放射学图像数据集（Multi-CXR）上继续预训练，并从通用领域权重进行领域迁移实验。
使用外部 CXR 数据集进行线性探测，比较图像-文本和多模态基线（如 CLIP 变体、BiomedCLIP、BioViL-T、MRM）。
在图像分类（VinDr-CXR、CANDID-PTX、RSNA Pneumonia）、语义分割（CANDID-PTX、基于 MIMIC-CXR 的数据集）、以及视觉-语言任务（MIMIC-CXR 的文本报告生成）上进行评估。
研究消融：输入分辨率、权重初始化和训练数据规模/多样性对下游性能的影响。

实验结果

研究问题

RQ1仅图像的自监督学习是否能在标准成像任务中达到甚至超过文本监督的生物医学编码器？
RQ2raddino 在全球与局部（补丁级）任务上，随着训练数据规模、多样性及更高输入分辨率的增加，是否具备良好的可扩展性？
RQ3是否存在仅图像的编码器产生与患者人口统计信息和非报告的临床信息更一致的表示？
RQ4MIM 与领域迁移预训练对分割和视觉-语言生成性能的影响是什么？
RQ5是否存在一种纯图像基的预训练方法，能够在不依赖图像-文本数据的情况下，成为统一的基础生物医学图像编码器？

主要发现

raddino 在多样化的生物医学基准测试中，在图像分类和分割任务上匹配或超越最先进的语言监督模型。
在 VinDr-CXR 上，raddino 实现最高的 Agg AUPRC（66.63），并在发现项上超越 CLIP 和其他基线。
在 CANDID-PTX 和 RSNA Pneumonia 中，raddino 取得强劲结果，特别在 PTX 相关任务中的气胸和胸腔引流管表现出色。
在视觉-语言生成方面，基于 raddino 的编码器获得更高的 ROUGE-L、BLEU-4、RG ER 和 Macro-F1-14 分数，表明生成发现的事实性和临床准确性更高。
消融研究表明性能随更大且更具多样性的训练数据和更高输入分辨率而提升；从通用领域模型进行领域迁移有帮助，但在域内持续预训练可带来更多收益。
raddino 编码与更广泛的临床信息（例如人口统计）相关性高于语言监督模型，表明对多模态临床任务具有更广泛的适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。