QUICK REVIEW

[论文解读] ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration

Junyu Chen, Yufan He|arXiv (Cornell University)|Apr 13, 2021

Advanced Neural Network Applications参考文献 20被引用 139

一句话总结

ViT-V-Net 引入一种混合的 ConvNet-Transformer 架构用于无监督体积医学图像配准，在脑部 MRI 数据上实现比顶级方法更好的 Dice 性能。

ABSTRACT

In the last decade, convolutional neural networks (ConvNets) have dominated and achieved state-of-the-art performances in a variety of medical imaging applications. However, the performances of ConvNets are still limited by lacking the understanding of long-range spatial relations in an image. The recently proposed Vision Transformer (ViT) for image classification uses a purely self-attention-based model that learns long-range spatial relations to focus on the relevant parts of an image. Nevertheless, ViT emphasizes the low-resolution features because of the consecutive downsamplings, result in a lack of detailed localization information, making it unsuitable for image registration. Recently, several ViT-based image segmentation methods have been combined with ConvNets to improve the recovery of detailed localization information. Inspired by them, we present ViT-V-Net, which bridges ViT and ConvNet to provide volumetric medical image registration. The experimental results presented here demonstrate that the proposed architecture achieves superior performance to several top-performing registration methods.

研究动机与目标

Motivate deformable image registration (DIR) and address limitations of convolutional networks in modeling long-range spatial relations.
Propose a hybrid ViT-ConvNet architecture to enable long-range feature learning for 3D image registration.
Demonstrate that ViT-V-Net improves registration accuracy (Dice) and maintains localization via long skip connections.
Evaluate against state-of-the-art registration methods on a brain MRI dataset and provide implementation details.

提出的方法

Encode high-level features from fixed and moving 3D images through ConvNet blocks and pooling to reduce resolution.
Partition high-level features into N patches and apply a Vision Transformer to learn long-range relations.
Embed patches with a linear projection and add learnable position embeddings for spatial information.
Pass Transformer outputs through a V-Net–style decoder with long skip connections to preserve localization details.
Predict a dense displacement field u, warp the moving image with a spatial transformer, and optimize a loss combining MSE similarity and a diffusion regularizer.

实验结果

研究问题

RQ1Can a hybrid ConvNet-Transformer architecture improve unsupervised 3D image registration compared to fully ConvNet-based registries?
RQ2Do Vision Transformer–based encodings enhance long-range spatial relationships critical for accurate volumetric alignment?
RQ3Is the ViT-V-Net architecture able to achieve higher Dice scores than leading DIR methods on brain MRI data?

主要发现

Method	Affine	NiftyReg	SyN	VoxelMorph-1	VoxelMorph-2	ViT-V-Net	Dice
Dice	0.569 ± 0.171	0.713 ± 0.134	0.688 ± 0.140	0.707 ± 0.137	0.711 ± 0.135	0.726 ± 0.130

ViT-V-Net achieves higher Dice scores than several top registration methods in the tested setup.
Reported Dice on the main comparison table: ViT-V-Net 0.726 ± 0.130 vs. others (Affine 0.569 ± 0.171, NiftyReg 0.713 ± 0.134, SyN 0.688 ± 0.140, VoxelMorph-1 0.711 ± 0.135, VoxelMorph-2 0.707 ± 0.137).
ViT-V-Net trained with long skip connections retains localization information and shows lower training loss and higher validation Dice.
Statistical tests (paired t-test) show ViT-V-Net significantly better than several competitors (p-values listed in the paper).
The method runs on GPUs with reported times, highlighting feasibility of the approach for practical use.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。