QUICK REVIEW

[논문 리뷰] UNETR: Transformers for 3D Medical Image Segmentation

Ali Hatamizadeh, Tang, Yucheng|arXiv (Cornell University)|2021. 03. 18.

Radiomics and Machine Learning in Medical Imaging참고 문헌 52인용 수 210

한 줄 요약

UNETR는 트랜스포머 인코더를 사용하여 3D 의료 볼륨을 패치의 시퀀스로 처리하고, CNN 기반 디코더와 스킵 연결을 통해 정확한 3D 분할을 수행하며 BTCV 및 MSD 데이터셋에서 최첨단 결과를 달성합니다.

ABSTRACT

Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard. Code: https://monai.io/research/unetr

연구 동기 및 목표

의료 영상 분할에서 장거리 3D 컨텍스트를 포착하기 위해 트랜스포머를 활용하는 동기를 제시한다.
트랜스포머 인코더를 CNN 디코더와 스킵 연결로 직접 연결하는 UNETR 아키텍처를 제안한다.
BTCV 다기관 분할 및 MSD 뇌종양과 비장 분할 데이터셋에서 효과를 보여준다.

제안 방법

3D 볼륨을 겹치지 않는 패치로 표현하고 K-차원 임베딩으로 투영한다.
패치 시퀀스를 ViT-B16 스타일의 트랜스포머 인코더(L=12, K=768, 패치 크기 16^3)로 처리한다.
1D 위치 임베딩을 추가하고 시맨틱 세그멘테이션 작업이므로 클래스타okens를 제외한다.
중간 트랜스포머 표현(z3, z6, z9, z12)을 추출하고 공간 텐서로 재구성한 뒤 스킵 연결을 통해 CNN 기반 디코더와 융합한다.
여러 해상도에서 트랜스포머 특징을 디코더로 투영하기 위해 3x3x3 합성곱을 사용하고, 업샘플링을 위해 디컨볼루션을 적용하며, 최종 1x1x1 컨볼루션과 소프트맥스로 보셀 단위 예측을 수행한다.
소프트 Dice와 교차엔트로피 손실의 결합으로 학습하고, 0.5 겹침의 패치 기반 슬라이딩 윈도 추론을 사용한다.

실험 결과

연구 질문

RQ13D 패치로 학습된 트랜스포머 인코더가 부피 의료 이미지에서 분할을 위한 장거리 의존성을 포착할 수 있는가?
RQ2다중 해상도 스킵 연결을 통해 트랜스포머에서 도출된 특징을 CNN 기반 디코더에 연결하는 것이 CNN 또는 트랜스포머 단독 기반보다 분할 정확도를 향상시키는가?
RQ3디코더 설계, 패치 해상도, 모델 크기가 3D 의료 영상의 분할 성능에 미치는 영향은 무엇인가?

주요 결과

UNETR은 BTCV에서 Standard 및 Free Competitions 모두에서 최첨단 성능을 달성한다.
MSD의 뇌종양 및 비장 분할에서 UNETR은 경쟁 방법들을 능가하며 특히 담낭과 부신과 같이 작은 구조물에서 두드러진 성능을 보인다.
BTCV에서 평균 Dice 점수는 기준선에 비해 뚜렷한 향상을 보이며 소형 장기에 특히 이득이 크다.
MSD에서 UNETR은 뇌종양 하위 영역 및 비장 분할 전 범위에서 가장 강력한 기준선보다 더 높은 Dice 점수를 산출한다.
모델은 약 92.58M 파라미터와 41.19G FLOPs를 가지며, 다른 트랜스포머 기반 방법과 비교할 때 추론 시간은 경쟁력이 있다(평균 약 12.08초).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.