QUICK REVIEW

[论文解读] Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Hu Cao, Yueyue Wang|arXiv (Cornell University)|May 12, 2021

Advanced Neural Network Applications参考文献 31被引用 899

一句话总结

Swin-Unet 提出一个纯 Transformer 基于的 U 形编码-解码器，带跳跃连接用于 2D 医学图像分割，在 Synapse 上实现了最先进的结果，在 ACDC 上在无卷积的情况下表现出色。

ABSTRACT

In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at https://github.com/HuCaoFighting/Swin-Unet.

研究动机与目标

动机：CNN 在医学图像分割中难以捕捉全局的长程交互。
提出一种纯 Transformer 基的类似 U-Net 的架构（Swin-Unet），用于建模从局部到全局的上下文。
在对称的 Transformer U-Net 中通过跳跃连接实现多尺度特征学习。
引入 patch expanding 用于上采样，无需卷积。
展示在多器官 CT 与心脏 MRI 分割数据集上的鲁棒性和泛化能力。

提出的方法

将二维医学图像分割成不重叠的 4x4 patch，并嵌入为 token 特征。
使用带有 patch merging 的分层 Swin Transformer 编码器以学习多尺度表示。
采用对称的基于 Swin Transformer 的解码器，配有用于上采样的 patch expanding 层。
引入跳跃连接，将编码器的多尺度特征与解码器特征融合。
使用 ImageNet 预训练权重和标准 SGD 优化进行训练；在 Synapse 和 ACDC 数据集上进行评估。

实验结果

研究问题

RQ1纯 Transformer 基的 U-Net（Swin-Unet）在没有 CNN 组件的情况下能否实现有竞争力的分割性能？
RQ2patch merging/下采样与 patch expanding 上采样如何影响分割准确性和边界精度？
RQ3跳跃连接、输入尺寸和模型规模对不同器官与数据集的分割性能有何影响？
RQ4Swin-Unet 是否能很好地泛化到不同的医学成像模态（CT 与 MRI）以及任务（多器官和心脏分割）？

主要发现

Swin-Unet 在 Synapse 数据集上取得了在所评估方法中最佳的 DSC (79.13) 和 HD (21.55)。
Swin-Unet 展现出强边界预测，与若干基线相比 HD 提升（如 21.55 HD 相较于其他方法）。
在 ACDC 数据集上，Swin-Unet 获得 RV DSC 90.00，Myo 88.55，LV 85.62，LV 95.83，超越了若干基线。
消融研究表明 patch expanding 上采样优于双线性和转置卷积方法。
将输入尺寸从 224 增大到 384 可以提升各器官在 Synapse 的 DSC，但提高了计算成本；超 Tiny 的模型规模提升收益有限。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。