[论文解读] 3DSAM-adapter: Holistic adaptation of SAM from 2D to 3D for promptable tumor segmentation
本文提出了一种对 SAM 的全局、参数高效的 2D-to-3D 适配,用于体积医学肿瘤分割,在最少可调参数和每个体积仅一个提示的情况下实现了最先进的结果。
Despite that the segment anything model (SAM) achieved impressive results on general-purpose semantic segmentation with strong generalization ability on daily images, its demonstrated performance on medical image segmentation is less precise and not stable, especially when dealing with tumor segmentation tasks that involve objects of small sizes, irregular shapes, and low contrast. Notably, the original SAM architecture is designed for 2D natural images, therefore would not be able to extract the 3D spatial information from volumetric medical data effectively. In this paper, we propose a novel adaptation method for transferring SAM from 2D to 3D for promptable medical image segmentation. Through a holistically designed scheme for architecture modification, we transfer the SAM to support volumetric inputs while retaining the majority of its pre-trained parameters for reuse. The fine-tuning process is conducted in a parameter-efficient manner, wherein most of the pre-trained parameters remain frozen, and only a few lightweight spatial adapters are introduced and tuned. Regardless of the domain gap between natural and medical data and the disparity in the spatial arrangement between 2D and 3D, the transformer trained on natural images can effectively capture the spatial patterns present in volumetric medical images with only lightweight adaptations. We conduct experiments on four open-source tumor segmentation datasets, and with a single click prompt, our model can outperform domain state-of-the-art medical image segmentation models on 3 out of 4 tasks, specifically by 8.25%, 29.87%, and 10.11% for kidney tumor, pancreas tumor, colon cancer segmentation, and achieve similar performance for liver tumor segmentation. We also compare our adaptation method with existing popular adapters, and observed significant performance improvement on most datasets.
研究动机与目标
- 由于领域和维度差异,SAM 在三维医学肿瘤分割上的性能欠佳且不稳定。
- 在尽可能重复使用预训练权重的前提下,整体适配 SAM 以处理体积输入。
- 开发参数高效的微调和轻量级适配器,以弥合二维预训练与三维医学数据之间的差距。
- 提高对提示的鲁棒性,并在多个肿瘤数据集上保持高分割精度。
提出的方法
- 在尽量不改变参数数量的前提下修改图像编码器以接受体积输入,包括 3D 补丁嵌入、3D 位置编码,以及带内存高效策略的 3D 注意力块。
- 引入用于微调的轻量级空间适配器,冻结大部分 SAM 权重,只训练适配器和归一化层。
- 用视觉取样器替代提示编码,从图像特征图获取嵌入,并使用少量全局查询和交叉注意力来缓解提示中的标记爆炸和噪声。
- 将掩码解码器更新为具有多层聚合的轻量级 3D CNN,以生成高分辨率的 3D 掩码。
- 使用每个体积单点或少量点提示进行训练,包括背景前景采样以提高对嘈杂提示的鲁棒性。

实验结果
研究问题
- RQ1对 SAM 的全局 2D-to-3D 适配是否能够在有效编码医学体积的三维空间模式的同时保留预训练知识?
- RQ2在有限提示条件下,基于视觉取样器的提示编码是否在三维医学分割中优于传统的位置编码?
- RQ3三维掩码解码器中的多层聚合如何影响小型、低对比度肿瘤的分割精度?
- RQ4与完整微调和其他适配器相比,可调参数数量与体积肿瘤分割中的分割性能之间的权衡是什么?
主要发现
| 方法 | 肾脏肿瘤 Dice | 肾脏肿瘤 NSD | 胰腺肿瘤 Dice | 胰腺肿瘤 NSD | 肝脏肿瘤 Dice | 肝脏肿瘤 NSD | 结肠癌 Dice | 结肠癌 NSD | #调参参数 |
|---|---|---|---|---|---|---|---|---|---|
| nnU-Net (Nat. Methods 2021) | 73.07 | 77.47 | 41.65 | 62.54 | 60.10 | 75.41 | 43.91 | 52.52 | 30.76M |
| TransBTS (MICCAI 2021) | 40.79 | 37.74 | 31.90 | 41.62 | 34.69 | 49.47 | 17.05 | 21.63 | 32.33M |
| nnFormer (arXiv 2021) | 45.14 | 42.28 | 36.53 | 53.97 | 45.54 | 60.67 | 24.28 | 32.19 | 149.49M |
| Swin-UNETR (CVPR 2022) | 65.54 | 72.04 | 40.57 | 60.05 | 50.26 | 64.32 | 35.21 | 42.94 | 62.19M |
| UNETR++ (arXiv 2022) | 56.49 | 60.04 | 37.25 | 53.59 | 37.13 | 51.99 | 25.36 | 30.68 | 55.70M |
| 3D UX-Net (ICLR 2023) | 57.59 | 58.55 | 34.83 | 52.56 | 45.54 | 60.67 | 28.50 | 32.73 | 53.01M |
| SAM-B (1 pt/slice) [4] | 36.30 | 29.86 | 24.01 | 26.74 | 6.71 | 7.63 | 28.83 | 33.63 | – |
| Ours (1 pt/volume) | 73.78 | 83.86 | 54.09 | 76.27 | 54.78 | 69.55 | 48.35 | 63.65 | 25.46M |
| SAM-B (3 pts/volume) [4] | 74.91 | 84.35 | 54.92 | 77.57 | 56.30 | 70.02 | 49.43 | 65.02 | 25.46M |
| SAM-B (10 pts/slice) [4] | 75.95 | 84.92 | 57.47 | 79.62 | 56.61 | 69.52 | 49.99 | 65.67 | 25.46M |
- 我们的方法(每个体积 1 个提示)达到 Dice 分数:肾脏 73.78 和 NSD 83.86;胰腺 54.09 和 76.27;肝 54.78 和 69.55;结肠 48.35 和 63.65。
- 我们的在每个体积 3 个提示时的 Dice 为 74.91,NSD 为 84.35(肾脏 75.??,胰腺 54.92,肝 77.57,结肠 56.30,见表),这表明更多提示时性能提升。
- 与最先进的基线(nnU-Net 等)相比,所提出的方法在四个数据集中的三个上通常优于基线,在胰腺和结肠癌任务上有显著提升。
- 我们的方法以原始模型仅 16.96% 的可调参数量超越了现有的适配器和全微调基线,显示出高参数效率。
- 视觉取样器提示编码显著优于位置编码(KiTS21 消融中的 Dice 提升约 40%)。
- 掩码解码器中的多层聚合相较于非聚合变体实现约 15.75% 的 Dice 提升。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。