QUICK REVIEW

[论文解读] UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

Yunhe Gao, Mu Zhou|arXiv (Cornell University)|Jul 2, 2021

Radiomics and Machine Learning in Medical Imaging参考文献 26被引用 49

一句话总结

UTNet 将自注意力集成到基于 CNN 的 U-Net 以在多个尺度捕捉全局上下文，使用高效的注意力机制和相对位置编码，在心脏MRI分割方面实现卓越表现并具备跨厂商鲁棒性且无需预训练。

ABSTRACT

Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at different scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from $O(n^2)$ to approximate $O(n)$. A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.

研究动机与目标

Motivate the need for long-range context in medical image segmentation beyond conventional CNNs.
Propose a U-shaped hybrid Transformer network (UTNet) that injects efficient self-attention at multiple encoder/decoder levels.
Enable Transformer integration without pretraining through convolutional inductive bias.
Achieve accurate boundary-centric segmentation on high-resolution medical images while maintaining computational efficiency.

提出的方法

Introduce an efficient self-attention mechanism that reduces complexity from O(n^2) to approximately O(n) by projecting keys and values to a lower-dimensional space.
Apply self-attention at multiple levels (encoder and decoder) in a U-Net like architecture to capture multi-scale global context.
Incorporate 2D relative positional encoding to model content-position relationships in medical images.
Use a pre-activation residual block and Transformer block as building blocks within UTNet with identity mappings for skip connections.
Train from scratch without pretraining using a combination of Dice and cross-entropy losses.
Compare UTNet against UNet, ResUNet, CBAM, and Dual-Attention networks on multi-label, multi-vendor cardiac MRI data.

实验结果

研究问题

RQ1Can a hybrid CNN-Transformer architecture improve boundary-focused segmentation in high-resolution medical images without large-scale pretraining?
RQ2Does multi-level self-attention with relative positional encoding enhance segmentation robustness across different vendors?
RQ3What is the impact of efficient self-attention and its placement within the network on segmentation performance and computational efficiency?
RQ4How does UTNet perform compared to state-of-the-art CNN-based segmentation models in a multi-vendor cardiac MRI dataset?

主要发现

UTNet achieves the highest Dice scores across LV, MYO, and RV on vendor A data (LV 93.1, MYO 83.5, RV 88.2; Average Dice 88.3).
UTNet attains competitive/inferior parameter count and inference time relative to some attention-based baselines (Params 9.53M; Inference Time 0.145 s).
Ablation studies show placing self-attention at higher encoder/decoder levels and using an 8-dimensional reduced-size projection yields best performance; relative positional encoding is essential.
UTNet demonstrates superior robustness to cross-vendor evaluation, maintaining competitive segmentation performance on unseen vendors C and D where other models degrade more.
Compared to Dual-Attention (quadratic complexity), UTNet has lower memory and faster runtime while achieving better segmentation accuracy.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。