[论文解读] CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation
CoTr 将 CNN 编码器与可变形 Transformer 结合起来,以高效建模三维医学图像分割的长程上下文,在 BCV 11 个器官分割任务上实现了最先进的性能,同时能处理高分辨率多尺度特征。
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation. The convolutional operations used in these networks, however, inevitably have limitations in modeling the long-range dependency due to their inductive bias of locality and weight sharing. Although Transformer was born to address this issue, it suffers from extreme computational and spatial complexities in processing high-resolution 3D feature maps. In this paper, we propose a novel framework that efficiently bridges a {\bf Co}nvolutional neural network and a {\bf Tr}ansformer {\bf (CoTr)} for accurate 3D medical image segmentation. Under this framework, the CNN is constructed to extract feature representations and an efficient deformable Transformer (DeTrans) is built to model the long-range dependency on the extracted feature maps. Different from the vanilla Transformer which treats all image positions equally, our DeTrans pays attention only to a small set of key positions by introducing the deformable self-attention mechanism. Thus, the computational and spatial complexities of DeTrans have been greatly reduced, making it possible to process the multi-scale and high-resolution feature maps, which are usually of paramount importance for image segmentation. We conduct an extensive evaluation on the Multi-Atlas Labeling Beyond the Cranial Vault (BCV) dataset that covers 11 major human organs. The results indicate that our CoTr leads to a substantial performance improvement over other CNN-based, transformer-based, and hybrid methods on the 3D multi-organ segmentation task. Code is available at \def\UrlFont{ m\small tfamily} \url{https://github.com/YtongXie/CoTr}
研究动机与目标
- 以 Transformers 激发结合 CNN 的局部归纳偏置,用于 3D 医学图像分割。
- 开发一个高效的可变形 Transformer (DeTrans),在多尺度特征图上建模长程依赖。
- 设计一个 CNN-encoder–DeTrans-encoder–decoder 架构,在捕获全局上下文的同时保留高分辨率细节。
- 在 BCV 数据集上证明其分割性能优于基于 CNN、基于 Transformer 以及其他混合方法。
提出的方法
- 使用 CNN-encoder 提取多尺度 3D 特征图。
- 引入带多尺度可变形自注意力的 DeTrans-encoder,以高效捕获长程依赖。
- 使用 3D 位置编码展平 CNN 特征,并通过 DeTrans 层处理。
- 应用带有限采样点的多头可变形自注意力,以降低复杂度。
- 通过带跳跃连接和深度监督的 CNN 基解码器融合 DeTrans 输出。
- 使用联合 Dice 与交叉熵损失进行优化;采用数据扩增和实例归一化。
实验结果
研究问题
- RQ1在 3D 医学图像分割中,经过可变形自注意力增强的轻量混合 CNN–Transformer 编码器能否超过纯 CNN 或纯 Transformer 的方法?
- RQ2多尺度可变形自注意力是否能在高分辨率的 3D 特征图上实现有效的长程建模?
- RQ3DeTrans 超参数和多尺度特征集成对分割性能的影响是什么?
- RQ4在 BCV 的多器官分割任务中,CoTr 与现有的 CNN、Transformer 以及混合方法相比如何?
主要发现
- CoTr 在 BCV 3D 多器官分割上优于仅 CNN、仅 Transformer 以及其他混合基线。
- CoTr with deformable self-attention enables processing of multi-scale high-resolution feature maps with reduced computational and spatial complexity.
- CoTr variants with smaller CNN encoders (CoTr ∗, CoTr †) achieve strong results, showing the benefit of the hybrid encoder over purely Transformer-based encoders.
- Replacing DeTrans with traditional context modules (ASPP, PP, Non-local) leads to lower Dice scores, highlighting the advantage of deformable Transformers.
- CoTr consistently improves average Dice across 11 organs, notably for gallbladder and pancreas, and achieves competitive or superior performance to TransUNet in 3D settings.
- Training-time and inference-time efficiency: training ~2 days on GTX 2080Ti; inference under 30 ms per 48×192×192 volume.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。