QUICK REVIEW

[논문 리뷰] Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Hu Cao, Yueyue Wang|arXiv (Cornell University)|2021. 05. 12.

Advanced Neural Network Applications참고 문헌 31인용 수 899

한 줄 요약

Swin-Unet은 2D 의학 영상 분할에 대해 CNN 없이 순수 Transformer 기반의 U자형 인코더-디코더와 스킵 연결을 제안하여 Synapse에서 최첨단 성능을 달성하고 ACDC에서 합성 없이도 강한 성능을 보인다.

ABSTRACT

In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at https://github.com/HuCaoFighting/Swin-Unet.

연구 동기 및 목표

CNN이 의료 영상 분할에서 글로벌 장거리 상호작용을 포착하는 데 어려움을 겪는다는 점을 동기로 삼는다.
순수 Transformer 기반의 Unet 유사 아키텍처(Swin-Unet)를 제안하여 로컬에서 글로벌 맥락을 모델링한다.
대칭형 Transformer U-Net에서 skip 연결로 다중 해상도 특징을 학습한다.
업샘플링을 위한 패치 확장을 도입하여 합성 없이 업샘플링을 구현한다.
다기관 CT 및 심장 MRI 분할 데이터셋에서 강건성과 일반화를 입증한다.

제안 방법

비중첩 4x4 패치로 2D 의학 영상을 분할하고 토큰 특성으로 임베딩한다.
패치 병합을 이용한 계층적 Swin Transformer 인코더로 다중 스케일 표현을 학습한다.
패치 확장 계층으로 업샘플링을 수행하는 대칭 Swin Transformer 기반 디코더를 사용한다.
엔코더의 다중 스케일 특징을 디코더 특징과 융합하기 위한 skip 연결을 도입한다.
ImageNet 사전 학습 가중치를 사용하고 표준 SGD 최적화를 통해 학습하며 Synapse 및 ACDC 데이터셋에서 평가한다.

실험 결과

연구 질문

RQ1순수 Transformer 기반의 U-Net(Swin-Unet)이 CNN 구성요소 없이도 경쟁력 있는 분할 성능을 달성할 수 있는가?
RQ2패치 병합/다운샘플링 및 패치 확장 업샘플링이 분할 정확도와 경계 정밀도에 어떤 영향을 미치는가?
RQ3skip 연결, 입력 크기, 모델 규모가 장기 및 데이터셋 간의 분할 성능에 어떤 영향을 미치는가?
RQ4Swin-Unet이 CT 및 MRI를 포함한 다양한 의학 영상 모달리티와 다기관 및 심장 분할 작업에 대해 일반화되는가?

주요 결과

방법	DSC 상승	HD 하강	Aorta	Gallbladder	Kidney(L)	Kidney(R)	Liver	Pancreas	Spleen	Stomach
V-Net	68.81	-	75.34	51.87	77.10	80.75	87.84	40.05	80.56	56.98
DARR	69.77	-	74.74	53.77	72.31	73.24	94.08	54.18	89.90	45.96
R50 U-Net	74.68	36.87	87.74	63.66	80.60	78.19	93.74	56.90	85.87	74.16
U-Net	76.85	39.70	89.07	69.72	77.77	68.60	93.43	53.98	86.67	75.58
R50 Att-UNet	75.57	36.97	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
Att-UNet	77.77	36.02	89.55	68.88	77.98	71.11	93.57	58.04	87.30	75.75
R50 ViT	71.29	32.87	73.73	55.13	75.80	72.20	91.51	45.99	81.99	73.95
TransUnet	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
SwinUnet	79.13	21.55	85.47	66.53	83.28	79.61	94.29	56.58	90.66	76.60

Swin-Unet은 Synapse 데이터셋에서 평가된 방법들 중 최고 DSC(79.13)와 HD(21.55)를 달성했다.
Swin-Unet은 경계 예측이 강력하고 HD를 여러 베이스라인보다 개선하였다(예: 21.55 HD vs 다른 방법들).
ACDC 데이터셋에서 Swin-Unet은 RV에 대한 DSC 90.00, Myo 88.55, LV 85.62, LV 95.83을 달성하여 여러 베이스라인을 능가한다.
애블레이션 연구에서 패치 확장 업샘플링이 양선형 및 전이합성(convolution) 방법보다 우수함을 보였다.
입력 크기를 224에서 384로 늘리면 Synapse의 기관별 DSC가 증가하지만 계산 비용이 증가하며 Tiny를 넘어서는 모델 확장은 제한된 이점을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.