QUICK REVIEW

[论文解读] LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation

Guoping Xu, Xingrong Wu|arXiv (Cornell University)|Jul 19, 2021

Advanced Neural Network Applications参考文献 39被引用 54

一句话总结

LeViT-UNet 将基于 LeViT 的变换器编码器嵌入到类似 U-Net 的解码器中，实现快速、准确的二维医学图像分割，具备来自变换器和 CNN 块的多尺度特征融合。它在 Synapse 上显示出具有竞争力的准确性和改进的边界预测，并在 ACDC 上具有很强的泛化能力。

ABSTRACT

Medical image segmentation plays an essential role in developing computer-assisted diagnosis and therapy systems, yet still faces many challenges. In the past few years, the popular encoder-decoder architectures based on CNNs (e.g., U-Net) have been successfully applied in the task of medical image segmentation. However, due to the locality of convolution operations, they demonstrate limitations in learning global context and long-range spatial relations. Recently, several researchers try to introduce transformers to both the encoder and decoder components with promising results, but the efficiency requires further improvement due to the high computational complexity of transformers. In this paper, we propose LeViT-UNet, which integrates a LeViT Transformer module into the U-Net architecture, for fast and accurate medical image segmentation. Specifically, we use LeViT as the encoder of the LeViT-UNet, which better trades off the accuracy and efficiency of the Transformer block. Moreover, multi-scale feature maps from transformer blocks and convolutional blocks of LeViT are passed into the decoder via skip-connection, which can effectively reuse the spatial information of the feature maps. Our experiments indicate that the proposed LeViT-UNet achieves better performance comparing to various competing methods on several challenging medical image segmentation benchmarks including Synapse and ACDC. Code and models will be publicly available at https://github.com/apple1986/LeViT_UNet.

研究动机与目标

通过将基于变换器的全局上下文与 CNN 本地特征相结合，推动医学图像分割的改进。
提出一个轻量级的 LeViT 基编码器，集成到 U-Net 风格的解码器中。
开发一种多尺度特征融合策略，以同时利用变换器和卷积特征。
在多个医学分割数据集上进行评估，以评估准确性和效率。

提出的方法

使用 LeViT 作为编码器以在降低 FLOPs 的同时提取全局上下文。
在编码器的最后阶段，将来自卷积块和变换器块的多尺度特征拼接。
保留基于 CNN 的解码器，采用级联上采样和跳跃连接以恢复分辨率。
在 ImageNet-1k 上对 LeViT 主干网络进行预训练以初始化参数。
比较三个变体 LeViT-UNet-128s、-192、-384 以研究通道数量的影响和性能。
对变换器存在性、跳跃连接和预训练进行消融分析，以了解它们的影响。

实验结果

研究问题

RQ1基于 LeViT 的编码器是否能在 U-Net 框架内在保持接近实时的效率的同时提高分割准确性？
RQ2将多尺度变换器和 CNN 特征融合是否同时提升全局上下文和局部细节用于医学分割？
RQ3变换器通道数量和跳跃连接数量对分割性能和边界精度有何影响？
RQ4在标准医学数据集（Synapse、ACDC）上，LeViT-UNet 与最先进的 CNN 和变换器方法相比的表现如何？

主要发现

方法	DSC ↓?	HD ↓?	主动脉	胆囊	左肾	右肾	肝脏	胰腺	脾脏	胃	参数量(M)	FLOPs(G)	FPS
V-Net	68.81	-	75.34	51.87	77.10	80.75	87.84	40.05	80.56	56.98	-	-	-
DARR	69.77	-	74.74	53.77	72.31	73.24	94.08	54.18	89.90	45.96	-	-	-
U-Net	76.85	39.70	89.07	69.72	77.77	68.60	93.43	53.98	86.67	75.58	-	-	-
R50 U-Net	74.68	36.87	87.74	63.66	80.60	78.19	93.74	56.90	85.87	74.16	-	-	-
R50 Att-UNet	75.57	36.97	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95	-	-	-
R50-Deeplabv3+	75.73	26.93	86.18	60.42	81.18	75.27	92.86	51.06	88.69	70.19	-	-	-
R50 ViT	71.29	32.87	73.73	55.13	75.80	72.20	91.51	45.99	81.99	73.95	-	-	-
TransUnet	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62	105.28	24.64	50
SwinUnet	79.13	21.55	85.47	66.53	83.28	79.61	94.29	56.58	90.66	76.60	-	-	-
LeViT-UNet-128s	73.69	23.92	86.45	66.13	79.32	73.56	91.85	49.25	79.29	63.70	15.91	17.55	114
LeViT-UNet-192	74.67	18.86	85.69	57.37	79.08	75.90	92.05	53.53	83.11	70.61	19.90	18.92	95
LeViT-UNet-384	78.53	16.84	87.33	62.23	84.61	80.25	93.11	59.07	88.86	72.76	52.17	25.55	85

LeViT-UNet-384 在 Synapse 上达到 78.53% DSC 和 16.84 mm HD，在边界精度方面优于若干 SOTA 方法。
在 Synapse 上，LeViT-UNet 变体在各器官上达到有竞争力的 DSC，其中 LeViT-UNet-384 提供的 HD 为报道方法中最佳（16.84 mm）。
LeViT-UNet-384 在 ACDC 的 RV 和 LV 上分别达到 90.32 DSC 和 93.76 DSC，展示出强烈的心脏分割性能。
增加变换器通道数量并引入变换器块，始终比非变换器基线提高 DSC 和 HD。
更多的跳跃连接通常会提升性能，对较小的器官如主动脉和胆囊尤其有显著提升。
预训练有利于较大的变换器主干（如 LeViT-UNet-384），但对较小版本效果参差不齐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。