QUICK REVIEW

[논문 리뷰] UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

Yunhe Gao, Mu Zhou|arXiv (Cornell University)|2021. 07. 02.

Radiomics and Machine Learning in Medical Imaging참고 문헌 26인용 수 49

한 줄 요약

UTNet은 CNN 기반 U-Net에 자체 주의 메커니즘을 통합하여 다중 스케일에서 전역 맥락을 포착하고, 효율적인 어텐션 메커니즘과 상대 위치 인코딩으로 사전학습 없이도 우수한 심장 MRI 분할 및 벤더 간 강건성을 달성한다.

ABSTRACT

Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at different scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from $O(n^2)$ to approximate $O(n)$. A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.

연구 동기 및 목표

의학 영상 분할에서 기존의 CNN을 넘어 장거리 맥락의 필요성을 동기 부여한다.
다중 인코더/디코더 레벨에 효율적 self-attention을 주입하는 U자형 하이브리드 Transformer 네트워크(UTNet)를 제안한다.
합성곱 기반의 귀납 편향을 통해 사전학습 없이도 Transformer의 통합을 가능하게 한다.
연산 효율을 유지하면서 고해상도 의학 영상에서 경계 중심의 정확한 분할을 달성한다.

제안 방법

키와 값을 저차원 공간으로 투영하여 복잡도를 O(n^2)에서 대략 O(n)으로 감소시키는 효율적 자체 주의 메커니즘을 도입한다.
다중 척도 글로벌 컨텍스트를 포착하기 위해 U-Net 유사한 아키텍처에서 다중 레벨(인코더와 디코더)에서 자체 주의를 적용한다.
의학 영상에서 내용-위치 관계를 모형화하기 위해 2D 상대 위치 인코딩을 도입한다.
UTNet 내에서 프리-액티베이션 잔차 블록과 Transformer 블록을 빌딩 블록으로 사용하고 스킵 연결은 항등 매핑을 사용한다.
Dice와 cross-entropy 손실의 조합을 사용하여 사전학습 없이 처음부터 학습한다.
다중 레이블, 다중 벤더 심장 MRI 데이터에서 UTNet를 UNet, ResUNet, CBAM, 및 Dual-Attention 네트워크와 비교한다.

실험 결과

연구 질문

RQ1대규모 사전학습 없이도 하이레졸루션 의학 영상에서 경계 중심의 분할을 개선할 수 있는 하이브리드 CNN-Transformer 아키텍처가 가능한가?
RQ2상대 위치 인코딩을 포함한 다중 레벨 자체 주의가 다양한 벤더 간 분할의 강건성을 향상시키는가?
RQ3네트워크 내의 효율적 자체 주의와 그 위치가 분할 성능 및 계산 효율성에 미치는 영향은 무엇인가?
RQ4다중 벤더 심장 MRI 데이터셋에서 UTNet의 성능은 최첨단 CNN 기반 분할 모델과 비교하여 어느 정도인가?

주요 결과

UTNet은 벤더 A 데이터에서 LV, MYO, RV 모든 측면에서 최고 Dice 점수를 달성한다 (LV 93.1, MYO 83.5, RV 88.2; Average Dice 88.3).
UTNet은 일부 어텐션 기반 베이스라인에 비해 파라미터 수 및 추론 시간이 경쟁적이거나 더 나쁠 수 있지만 여전히 양호하다 (Params 9.53M; Inference Time 0.145 s).
절단 연구(Ablation)에서 자체 주의를 더 높은 인코더/디코더 레벨에 배치하고 8차원 축소 프로젝션을 사용하는 것이 최상의 성능을 보이며 상대 위치 인코딩이 필수적이다.
UTNet은 벤더 간 평가에서 뛰어난 강건성을 보이며 미지의 벤더 C와 D에서도 경쟁력 있는 분할 성능을 유지하고 다른 모델은 더 큰 저하를 보인다.
Quadratic 복잡도의 Dual-Attention과 비교할 때 UTNet은 메모리는 더 적고 런타임은 더 빠르면서도 더 나은 분할 정확도를 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.