QUICK REVIEW

[论文解读] How to Efficiently Adapt Large Segmentation Model(SAM) to Medical Images

Xinrong Hu, Xiaowei Xu|arXiv (Cornell University)|Jun 23, 2023

Advanced Neural Network Applications被引用 29

一句话总结

本文仅对 SAM 编码器进行微调，搭配一个轻量级、不可提示的预测头（ViT AutoSAM、CNN 或 Linear），以使 SAM 适应医学图像分割，实现高效的少-shot 学习和无需提示即可处理多类别掩码。

ABSTRACT

The emerging scale segmentation model, Segment Anything (SAM), exhibits impressive capabilities in zero-shot segmentation for natural images. However, when applied to medical images, SAM suffers from noticeable performance drop. To make SAM a real ``foundation model" for the computer vision community, it is critical to find an efficient way to customize SAM for medical image dataset. In this work, we propose to freeze SAM encoder and finetune a lightweight task-specific prediction head, as most of weights in SAM are contributed by the encoder. In addition, SAM is a promptable model, while prompt is not necessarily available in all application cases, and precise prompts for multiple class segmentation are also time-consuming. Therefore, we explore three types of prompt-free prediction heads in this work, include ViT, CNN, and linear layers. For ViT head, we remove the prompt tokens in the mask decoder of SAM, which is named AutoSAM. AutoSAM can also generate masks for different classes with one single inference after modification. To evaluate the label-efficiency of our finetuning method, we compare the results of these three prediction heads on a public medical image segmentation dataset with limited labeled data. Experiments demonstrate that finetuning SAM significantly improves its performance on medical image dataset, even with just one labeled volume. Moreover, AutoSAM and CNN prediction head also has better segmentation accuracy than training from scratch and self-supervised learning approaches when there is a shortage of annotations.

研究动机与目标

激发将自然图像基础模型 SAM 适配到医学影像领域的需求。
提出一种轻量级微调策略：冻结 SAM 编码器并添加一个用于非提示式多类别分割的预测头。
在有限标注数据下评估三种头部架构（基于 ViT 的 AutoSAM、CNN 与 Linear）。
展示在公开医学影像数据集上，相较从零开始训练和自监督基线的标签高效改进。

提出的方法

冻结 SAM 编码器权重并附加一个轻量级、特定任务的预测头进行微调。
用一个非提示头替换 SAM 掩码解码器；在 AutoSAM 中通过为每个类别复制嵌入向量来实现多类别掩码。
评估三种头部架构：基于 ViT 的 AutoSAM、基于 CNN 的头部（UNet 风格解码器）以及 Linear 头。
使用少量标记体积（1 个或 5 个）并混合交叉熵与 Dice 损失进行训练。
与从零开始训练的 UNet、基于 SimCLR 的自监督预训练，以及原始的带框提示的零-shot SAM 进行比较。

实验结果

研究问题

RQ1在标注数据有限的情况下，冻结 SAM 的编码器并添加一个轻量级、不可提示的头是否能实现具竞争力的医学分割？
RQ2在少样本条件下，哪种头部架构（AutoSAM ViT、CNN、Linear）表现最好？
RQ3AutoSAM 是否在不同医学数据集上实现无提示的高效多类别分割？

主要发现

方法	Dice%	ASSD	RV	Myo	LV
UNET	13.45 ± 1.89	16.24 ± 4.14	22.95 ± 0.47	17.55 ± 2.05	51.55 ± 6.42
UNET + SimCLR	14.25 ± 6.52	19.40 ± 6.36	27.54 ± 9.80	20.40 ± 3.95	33.14 ± 4.39
Encoder + LN	0.00 ± 0.00	20.42 ± 13.20	48.40 ± 22.50	22.94 ± 12.32	49.38 ± 12.32
Encoder + CNN	30.66 ± 14.28	39.96 ± 8.14	50.55 ± 13.56	40.39 ± 11.90	38.13 ± 16.42
AutoSAM (ft all)	17.10 ± 9.76	30.05 ± 7.77	43.82 ± 13.91	30.32 ± 10.05	25.93 ± 1.94
AutoSAM	31.66 ± 13.26	33.49 ± 9.23	52.83 ± 16.49	39.32 ± 12.82	23.59 ± 2.07
sup w/ UNET	40.36 ± 2.36	52.23 ± 3.80	62.91 ± 5.58	51.83 ± 3.41	32.28 ± 1.40
5 volumes / UNET + SimCLR	45.48 ± 4.65	58.20 ± 6.12	68.95 ± 3.88	57.18 ± 3.20	28.98 ± 7.13
5 volumes / Encoder + LN	22.07 ± 11.2	37.38 ± 11.56	33.69 ± 27.63	31.05 ± 16.14	-
5 volumes / Encoder + CNN	59.87 ± 1.86	62.81 ± 2.82	78.96 ± 2.79	67.21 ± 1.32	25.46 ± 11.14
5 volumes / AutoSAM (ft all)	22.43 ± 18.03	37.08 ± 13.49	53.75 ± 15.08	37.76 ± 15.22	24.44 ± 9.92
5 volumes / AutoSAM	58.48 ± 3.90	62.18 ± 2.97	80.58 ± 1.42	67.08 ± 2.56	17.54 ± 3.65
5 volumes / unsup SAM (box)	53.57 ± 0.86	39.60 ± 0.65	0.00 ± 0.00	31.06 ± 0.41	7.83 ± 0.67

仅用一个带标签的体积就对 SAM 编码器进行微调并配合轻量头，能显著提升医学分割性能。
AutoSAM 和 CNN 头在低数据条件下优于从零开始训练和 SimCLR 自监督，在拟合不足时 Linear 头表现不佳。
AutoSAM（基于 ViT 的头）和 CNN 头取得比其他基线更高的 Dice 分数，AutoSAM 往往提供更好的 ASSD。
较大的 SAM 编码器尺寸（vit-h）通常会提高结果，尽管 AutoSAM 对编码器尺寸的敏感性低于 Encoder + CNN。
随着标注数据增加到 5 个体积，性能差距扩大，AutoSAM 和 CNN 头领先，尤其在 Dice 分数方面。
AutoSAM 可以在一次推理中为多类别生成掩码，通过为每个类别复制嵌入向量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。