QUICK REVIEW

[论文解读] Contrastive Learning of Medical Visual Representations from Paired Images and Text

Yuhao Zhang, Hang Jiang|arXiv (Cornell University)|Oct 2, 2020

Multimodal Machine Learning Applications参考文献 38被引用 278

一句话总结

ConVIRT 通过对配对报告进行双向的图像-文本对比学习来预训练医学图像编码器，相对于 ImageNet 及其他基线，在同行域表示和数据效率方面表现更优。

ABSTRACT

Learning visual representations of medical images (e.g., X-rays) is core to medical image understanding but its progress has been held back by the scarcity of human annotations. Existing work commonly relies on fine-tuning weights transferred from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. Meanwhile, several recent studies show exciting results from unsupervised contrastive learning from natural images, but we find these methods help little on medical images because of their high inter-class similarity. We propose ConVIRT, an alternative unsupervised strategy to learn medical visual representations by exploiting naturally occurring paired descriptive text. Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test ConVIRT by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that it leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10\% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

研究动机与目标

激发在医疗保健领域有限标注数据的情况下学习高质量的医学图像表示。
利用天然配对的医学图像与描述性报告，在无需额外专家标注的情况下改进视觉编码器。
评估 ConVIRT 预训练编码器向多种医学成像任务与检索设置的迁移能力。

提出的方法

通过模态特定编码器和投影头将图像与文本表示为 d 维向量。
使用带有两个不对称损失的双向对比目标：图像到文本与文本到图像，作为加权和组合。
采样随机图像视图和文本片段以创建用于对比学习的多样化正向对。
在来自 MIMIC-CXR 的配对数据和一个肌肉骨骼数据集上预训练图像编码器（ResNet50）和文本编码器（BERT 基的 ClinicalBERT）。
应用适合医学图像的数据增强（裁剪、翻转、仿射、颜色抖动、高斯模糊）以及句子级文本采样。
通过线性分类与微调在四个医学分类任务上评估预训练编码器，并执行零-shot 图像-图像与文本-图像检索。

实验结果

研究问题

RQ1医学图像及其配对描述性文本之间的跨模态对比学习，是否能产生比仅图像或随机初始化更好的可视表示？
RQ2ConVIRT 是否提高数据效率，在相比 ImageNet 预训练模型需要显著更少标注数据的情况下实现有竞争力的性能？
RQ3ConVIRT 表示在多样化的医学成像任务和零-shot 检索设置中的迁移能力如何？

主要发现

ConVIRT 在四个分类任务上通常优于随机、ImageNet 和同领域基线，在线性与微调设置中表现良好。
在四个任务中的三个，且仅用 1% 标注数据，ConVIRT 的性能等同或超过使用 100% 数据的 ImageNet 初始化。
在零-shot 检索中，ConVIRT 在图像-图像与文本-图像任务上实现最佳的 Precision@k。
相较于仅图像对比方法（SimCLR、MoCo v2），ConVIRT 通过利用配对文本获得显著提升。
显著性分析表明 ConVIRT 将注意力聚焦在更相关的解剖区域，相较于 ImageNet 或其他基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。