QUICK REVIEW

[論文レビュー] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu|arXiv (Cornell University)|Feb 8, 2021

Advanced Neural Network Applications参考文献 16被引用数 3,794

ひとこと要約

TransUNetはCNNベースの高解像度特徴とTransformerのグローバルコンテキストを組み合わせ、最先端の医用画像分割を実現し、複数のデータセットで純粋なCNNおよび純粋なTransformerのベースラインを上回る。

ABSTRACT

Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet.

研究の動機と目的

CNNs（U-Net）が医用分割における長距離依存性に苦しむ理由を動機づける。
高解像度のディテールとグローバルコンテキストの両方を活用するハイブリッドCNN-Transformerエンコーダを提案する。
細かな空間ディテールを回復するためのスキップ接続を備えたカスケードアップサンプリングデコーダを設計する。
複数の医用画像タスクでCNNベースおよびTransformerベースのベースラインを empirical gainsで上回ることを示す。

提案手法

画像パッチをトークン化しTransformerでグローバルコンテキストを捕捉する。
高解像度パッチをTransformer埋め込みに供給するためのCNN特徴マップを使用する（ハイブリッドエンコーダ）。
Transformer特徴をカスケードアップサンプラー（CUP）でアップサンプリングし、U-Netのスキップ接続のように融合する。
事前学習済みバックボーンで標準的なSGDを用いて訓練する；デフォルトは224x224入力とパッチサイズ16。
“None”(ナイーブアップサンプリング)対CUPデコーダーおよび異なるエンコーダの選択を比較する。
スキップ接続、解像度、パッチサイズ、およびモデル規模のアブレーションを提供する。

実験結果

リサーチクエスチョン

RQ1TransformerはCNNベースの細部と補完することで医用画像分割の強力なエンコーダになり得るか？
RQ2ハイブリッドCNN-Transformerエンコーダとカスケードアップサンプリングデコーダは純粋なTransformerまたは純粋なCNNのベースラインより医用分割タスクで優れているか？
RQ3スキップ接続、入力解像度、パッチサイズ、モデル規模が分割品質に与える影響は？
RQ4TransUNetはCT多臓器分割と心筋MRI分割データセットを跨いで一般化できるか？

主な発見

Framework	Encoder	Decoder	DSC ↑	HD ↓	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
V-Net	V-Net	-	68.81	-	75.34	51.87	77.10	80.75	87.84	40.05	80.56	56.98
DARR	DARR	-	69.77	-	74.74	53.77	72.31	73.24	94.08	54.18	89.90	45.96
R50-U-Net	R50	U-Net	74.68	36.87	84.18	62.84	79.19	71.29	93.35	48.23	84.41	73.92
R50-AttnUNet	R50	AttnUNet	75.57	36.97	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
ViT	ViT	None	61.50	39.61	44.38	39.59	67.46	62.94	89.21	43.14	75.45	69.78
ViT	ViT	CUP	67.86	36.11	70.19	45.10	74.70	67.40	91.32	42.00	81.75	70.44
R50-ViT	R50	CUP	71.29	32.87	73.73	55.13	75.80	72.20	91.51	45.99	81.99	73.95
TransUNet	R50-ViT-CUP	CUP	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62

TransUNetはSynapse多臓器CTでR50-ViT-CUPベースラインを使用した場合の平均Diceスコア（DSC）で77.48%を達成し、ACDC心臓MRIデータセットでは89.71 DSCに達する（表5参照）。
アブレーションにより、複数のCUP解像度でスキップ接続を追加すると性能が向上し、最良の結果は1/2、1/4、1/8スケールでのスキップ時に得られる。
ハイブリッドエンコーダ（CNN+ViT）は純粋なViTおよび純粋なCNNベースラインを上回り、高解像度のCNN特徴とグローバルなTransformerコンテキストの組み合わせの利点を示す。
CUPデコーダはナイーブアップサンプリングより著しく改善し、モデルサイズが大きいほど性能が良くなる（ベース vs ラージでの比較）。
入力解像度を512x512に上げると計算コストの代わりに平均DSCが約6.88%向上する；パッチサイズ16（シーケンス長196）はより大きなパッチよりも良い。
定性的な結果として、TransUNetはCNNのみおよび他のTransformerベースモデルと比較して偽陽性が少なく、臓器境界をより正確に保持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。