QUICK REVIEW

[논문 리뷰] UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation

Ali Hatamizadeh, Ziyue Xu|arXiv (Cornell University)|2022. 04. 01.

Radiomics and Machine Learning in Medical Imaging인용 수 29

한 줄 요약

UNetFormer은 CNN/트랜스포머 디코더를 갖춘 3D Swin Transformer 인코더와 자체 지도 학습(pre-training) 스킴을 도입하여 MSD liver/liver tumor 및 BraTS brain tumor 작업에서 최첨단 세분화 성능을 달성합니다.

ABSTRACT

Vision Transformers (ViT)s have recently become popular due to their outstanding modeling capabilities, in particular for capturing long-range information, and scalability to dataset and model sizes which has led to state-of-the-art performance in various computer vision and medical image analysis tasks. In this work, we introduce a unified framework consisting of two architectures, dubbed UNetFormer, with a 3D Swin Transformer-based encoder and Convolutional Neural Network (CNN) and transformer-based decoders. In the proposed model, the encoder is linked to the decoder via skip connections at five different resolutions with deep supervision. The design of proposed architecture allows for meeting a wide range of trade-off requirements between accuracy and computational cost. In addition, we present a methodology for self-supervised pre-training of the encoder backbone via learning to predict randomly masked volumetric tokens using contextual information of visible tokens. We pre-train our framework on a cohort of $5050$ CT images, gathered from publicly available CT datasets, and present a systematic investigation of various components such as masking ratio and patch size that affect the representation learning capability and performance of downstream tasks. We validate the effectiveness of our pre-training approach by fine-tuning and testing our model on liver and liver tumor segmentation task using the Medical Segmentation Decathlon (MSD) dataset and achieve state-of-the-art performance in terms of various segmentation metrics. To demonstrate its generalizability, we train and test the model on BraTS 21 dataset for brain tumor segmentation using MRI images and outperform other methods in terms of Dice score. Code: https://github.com/Project-MONAI/research-contributions

연구 동기 및 목표

3D Vision Transformers를 활용하여 장거리 의존성을 포착함으로써 3D 의학 영상 분할 성능 향상을 촉진한다.
3D Swin Transformer 인코더를 CNN 또는 트랜스포머 디코더와 연결하는 UNetFormer 및 UNetFormer+ 아키텍처를 제안한다.
마스크된 부피 토큰 복원을 이용한 self-supervised pre-training scheme를 도입하여 다운스트림 성능을 높인다.
MSD liver/liver tumor 및 BraTS 21 MRI 뇌종양 데이터 세트에서 프레임워크를 평가하고 최첨단 결과를 보인다.
사전 학습 구성 요소(마스킹 비율, 패치 크기) 및 디코더 설계(CNN 대 Transformer)의 정확도/비용 트레이드오프를 분석한다.

제안 방법

3D Swin Transformer 인코더를 사용하여 3D 볼륨 입력으로부터 다중 해상도 특징을 추출한다.
Encoder를 CNN 기반의 UNetFormer 또는 Swin Transformer 기반의 UNetFormer+ 디코더에 5 해상도에서 스킵 연결과 깊은 감독으로 연결한다.
다중 해상도 분할 출력과 결합된 교차 엔트로피/소프트 다이스 손실을 이용한 깊은 감독을 적용한다.
가벼운 디코더를 통해 보이는 맥락으로 임의로 마스킹된 3D 토큰을 복원하고 마스킹된 토큰에 대한 L1 손실로 자체 지도 사전 학습 regime을 구현한다.
5050 CT 영상에서 학습 및 사전 학습을 수행하고 MSD 간/간 종양 및 BraTS 21 MRI 뇌종양 데이터 세트에서 미세 조정을 수행하여 전이 가능성을 보여준다.

실험 결과

연구 질문

RQ13D Swin Transformer 인코더를 CNN/트랜스포머 디코더에 연결하여 3D 의학 영상에서 CNN- 또는 ViT 기반 바탕선보다 분할 정확도를 개선할 수 있는가?
RQ2마스크된 부피 토큰 재구성을 통한 인코더의 자체 지도 사전 학습이 다운스트림 분할 성능을 개선하는가?
RQ3마스킹 비율과 패치 크기가 자체 지도 학습 및 이후 분할 성능에 어떠한 영향을 미치는가?
RQ4CNN 기반 디코더와 트랜스포머 기반 디코더는 간 및 뇌 종양 분할 작업에서 정확도와 계산 효율성 측면에서 어떻게 비교되는가?

주요 결과

사전 학습된 UNetFormer 모델이 MSD 간 및 간종양 분할에서 비사전 학습 기반 바탕보다 성능이 우수하다.
사전 학습은 간 및 간 종양 작업에서 무작위로 초기화된 모델보다 안정적인 이득을 제공한다.
대부분의 간 및 뇌 종양 작업에서 UNetFormer가 일반적으로 UNetFormer+보다 우수하며, UNetFormer+는 대형 기관/종양 사례에서 탁월하다.
모델은 정확도-비용 트레이드오프가 우수하며, UNetFormer+는 GFLOPs를 낮추는 동시에 다이스 점수를 경쟁력 있게 유지한다.
BraTS 21에서 UNetFormer 및 UNetFormer+는 모든 종양 영역에서 다이스 점수에서 여러 CNN/Swin/ViT 바탕선을 능가한다.
변형은 중간 정도의 마스킹(약 40%) 및 패치 크기 16^3이 다운스트림 다이스 성능에 유리하다는 것을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.