QUICK REVIEW

[論文レビュー] UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation

Ali Hatamizadeh, Ziyue Xu|arXiv (Cornell University)|Apr 1, 2022

Radiomics and Machine Learning in Medical Imaging被引用数 29

ひとこと要約

UNetFormerはCNN/トランスフォーマーデコーダーを備えた3D Swin Transformerエンコーダと自己教師付き事前学習スキームを導入し、MSD肝臓/肝臓腫瘍およびBraTS脳腫瘍タスクで最先端のセグメンテーションを達成します。

ABSTRACT

Vision Transformers (ViT)s have recently become popular due to their outstanding modeling capabilities, in particular for capturing long-range information, and scalability to dataset and model sizes which has led to state-of-the-art performance in various computer vision and medical image analysis tasks. In this work, we introduce a unified framework consisting of two architectures, dubbed UNetFormer, with a 3D Swin Transformer-based encoder and Convolutional Neural Network (CNN) and transformer-based decoders. In the proposed model, the encoder is linked to the decoder via skip connections at five different resolutions with deep supervision. The design of proposed architecture allows for meeting a wide range of trade-off requirements between accuracy and computational cost. In addition, we present a methodology for self-supervised pre-training of the encoder backbone via learning to predict randomly masked volumetric tokens using contextual information of visible tokens. We pre-train our framework on a cohort of $5050$ CT images, gathered from publicly available CT datasets, and present a systematic investigation of various components such as masking ratio and patch size that affect the representation learning capability and performance of downstream tasks. We validate the effectiveness of our pre-training approach by fine-tuning and testing our model on liver and liver tumor segmentation task using the Medical Segmentation Decathlon (MSD) dataset and achieve state-of-the-art performance in terms of various segmentation metrics. To demonstrate its generalizability, we train and test the model on BraTS 21 dataset for brain tumor segmentation using MRI images and outperform other methods in terms of Dice score. Code: https://github.com/Project-MONAI/research-contributions

研究の動機と目的

Vision Transformersを活用して長距離依存性を捉え、3D医用画像セグメンテーションを改善する動機づけ。
UNetFormerとUNetFormer+アーキテクチャを3D Swin Transformerエンコーダに接続し、CNNまたはトランスフォーマーのデコーダを五つの解像度で深層監視とともに接続する。
マスク付き体積トークン再構成を用いた自己教師付き事前学習スキームを導入し、下流性能を向上させる。
肝臓/肝臓腫瘍セグメンテーション（MSD）および脳腫瘍セグメンテーション（BraTS 21）でフレームワークを評価し、最先端の結果を示す。
事前学習の構成要素（マスキング比率、パッチサイズ）とデコーダ設計（CNN vs トランスフォーマー）の精度/コストのトレードオフを分析する。

提案手法

3D Swin Transformerエンコーダを使用して3Dボリューム入力からマルチスケール特徴を抽出する。
5つの解像度でのスキップ接続を介してエンコーダをCNNベース（UNetFormer）またはSwin Transformerベース（UNetFormer+）のデコーダに接続し、深層監視を行う。
多解像度のセグメンテーション出力と組み合わせたクロスエントロピー/ソフト Dice 損失で深層監視を適用する。
可視コンテキストを通じて軽量デコーダとマスクされたトークンに対するL1損失を用いて、ランダムにマスクされた3Dトークンを再構成する自己教師付き事前学習を実装する。
5050 CT画像で訓練・事前学習を行い、MSD肝臓/肝臓腫瘍およびBraTS 21 MRI脳腫瘍データセットでファインチューニングして移植性を示す。

実験結果

リサーチクエスチョン

RQ1統一された3D Swin TransformerエンコーダをCNN/トランスフォーマーデコーダと接続することで、3D医用画像のセグメンテーション精度はCNN-またはViTベースのベースラインを上回るか？
RQ2エンコーダの自己教師付き事前学習（マスク付きボリュームトークン再構成）により下流のセグメンテーション性能は向上するか？
RQ3マスキング比率とパッチサイズが自己教師付き学習とその後のセグメンテーション性能に与える影響は？
RQ4肝臓および脳腫瘍セグメンテーションタスクにおけるCNNベースとトランスフォーマー基盤デコーダの精度と計算効率はどう比較されるか？

主な発見

モデル	ET Dice	WT Dice	TC Dice	平均 Dice	#Params (M)	GFLOPs
TransBTS	86.60	90.30	89.81	88.91	30.61	131.88
nnFormer	86.87	92.68	90.15	89.90	149.04	106.45
SegResNet	88.40	92.70	91.70	90.90	18.78	122.16
nnUNet	88.60	92.91	91.40	91.01	33.07	454.34
UNetFormer+	88.48	93.67	91.89	91.20	24.44	39.63
UNetFormer	88.80	93.22	92.10	91.54	58.96	149.50

事前学習済みのUNetFormerモデルはMSDの肝臓および肝臓腫瘍セグメンテーションで非事前学習ベースラインを上回る。
事前学習は肝臓および肝臓腫瘍タスクのいずれにおいてもランダム初期化モデルより一貫した利得を提供する。
UNetFormerは一般に大半の肝臓および脳腫瘍タスクでUNetFormer+を上回り、UNetFormer+は大規模な臓器/腫瘍ケースで優れる。
モデルは精度とコストの良好なトレードオフを提供し、UNetFormer+はGFLOPsを抑えつつDiceスコアを競争力のある水準に維持する。
BraTS 21では、UNetFormerおよびUNetFormer+はすべての腫瘍領域におけるDiceスコアで複数のCNN/Swin/ViTベースラインを上回る。
アブレーションでは中程度のマスキング（約40%）とパッチサイズ16^3が下流のDiceパフォーマンスに有利である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。