QUICK REVIEW

[论文解读] UNETR: Transformers for 3D Medical Image Segmentation

Ali Hatamizadeh, Tang, Yucheng|arXiv (Cornell University)|Mar 18, 2021

Radiomics and Machine Learning in Medical Imaging参考文献 52被引用 210

一句话总结

UNETR 使用变换器编码器将三维医学体积处理为补丁序列，并连接到带跳跃连接的 CNN 基解码器，以实现准确的三维分割，在 BTCV 和 MSD 数据集上达到最先进的结果。

ABSTRACT

Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard. Code: https://monai.io/research/unetr

研究动机与目标

促使在医学影像分割中利用变换器来捕捉长程的三维上下文。
提出一种 UNETR 架构，将变换器编码器通过跳跃连接直接连接到 CNN 解码器。
在 BTCV 多器官分割以及 MSD 脑肿瘤和脾脏分割数据集上展示有效性。

提出的方法

将 3D 体积表示为不重叠的补丁，并投影到 K 维嵌入。
用 ViT-B16 风格的变换器编码器处理补丁序列（L=12，K=768，补丁大小 16^3）。
添加一维位置嵌入，由于任务是语义分割，因此省略类别令牌。
提取中间变换器表示（z3、z6、z9、z12），重塑为空间张量，并通过跳跃连接与基于 CNN 的解码器融合。
使用 3x3x3 卷积在多分辨率下将变换器特征投射到解码器；应用反卷积进行上采样；最终采用 1x1x1 卷积配合 softmax 进行体素级预测。
使用结合 soft Dice 和交叉熵损失进行训练；采用基于补丁的滑动窗口推理，重叠度为 0.5。

实验结果

研究问题

RQ1能否在 3D 补丁上训练的变换器编码器捕捉用于分割的体积医学图像中的长程依赖？
RQ2通过多分辨率跳跃连接将变换器派生的特征连接到基于 CNN 的解码器，是否能在分割精度上超过仅 CNN 或仅变换器的基线？
RQ3解码器设计、补丁分辨率和模型大小对三维医学影像分割性能有何影响？

主要发现

UNETR 在 BTCV 的 Standard 与 Free 竞赛中均取得了最先进的性能。
在 MSD 的脑肿瘤和脾脏分割中，UNETR 超过了竞争方法，尤其是在胆囊和肾上腺等小结构上。
在 BTCV 上，平均 Dice 分数明显优于基线，对小器官有显著提升。
在 MSD 上，UNETR 在脑肿瘤子区域和脾脏分割的 Dice 分数都高于最强基线。
该模型约有 92.58M 参数和 41.19G FLOPs，推理时间具有竞争力（平均约 12.08 秒），相较其他基于变换器的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。