QUICK REVIEW

[논문 리뷰] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu|arXiv (Cornell University)|2021. 02. 08.

Advanced Neural Network Applications참고 문헌 16인용 수 3,794

한 줄 요약

TransUNet은 CNN 기반의 고해상도 특징과 Transformer의 글로벌 컨텍스트를 결합하여 의학 영상 분할에서 최첨단 성능을 달성하고, 여러 데이터셋에서 순수 CNN 및 순수 Transformer 기준치를 능가합니다.

ABSTRACT

Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet.

연구 동기 및 목표

Motivate why CNNs (U-Net) struggle with long-range dependencies in medical segmentation.
Propose a hybrid CNN-Transformer encoder to leverage both high-resolution details and global context.
Design a cascaded upsampling decoder with skip connections to recover fine spatial details.
Demonstrate empirical gains over CNN-based and Transformer-based baselines on multiple medical imaging tasks.

제안 방법

Tokenize image patches and encode with a Transformer to capture global context.
Use a CNN feature map to supply high-resolution patches for Transformer embedding (hybrid encoder).
Upsample Transformer features with a cascaded upsampler (CUP) and fuse via U-Net-like skip connections.
Train with standard SGD on pretrained backbones; use 224x224 inputs with patch size 16 default.
Compare “None” (naive upsampling) vs CUP decoder and different encoder choices.
Provide ablations on skip connections, resolution, patch size, and model scale.]
research_questions:[

실험 결과

연구 질문

RQ1Can Transformers serve as strong encoders for medical image segmentation when complemented with CNN-based fine details?
RQ2Does a hybrid CNN-Transformer encoder plus a cascaded upsampling decoder outperform pure Transformer or pure CNN baselines in medical segmentation tasks?
RQ3What is the impact of skip connections, input resolution, patch size, and model scale on segmentation quality?
RQ4How well does TransUNet generalize across CT multi-organ segmentation and cardiac MRI segmentation datasets?

주요 결과

TransUNet achieves state-of-the-art average Dice Score (DSC) on Synapse multi-organ CT with 77.48% when using a R50-ViT-CUP baseline and reaches 89.71 DSC on ACDC cardiac MRI dataset (per Table 5).
Ablations show that adding skip connections at multiple CUP resolutions improves performance, with best results when skipping at 1/2, 1/4, and 1/8 scales.
Hybrid encoder (CNN + ViT) outperforms pure ViT and pure CNN baselines, demonstrating the benefit of combining high-resolution CNN features with global Transformer context.
CUP decoder significantly improves over naive upsampling, and larger model size yields better performance (Base vs Large in their tests).
Higher input resolution (512x512) raises average DSC by about 6.88% at the cost of computation; patch size 16 (sequence length 196) performs better than larger patches.
Qualitative results show TransUNet produces fewer false positives and preserves fine organ boundaries compared to CNN-only and other Transformer-based models.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.