Skip to main content
QUICK REVIEW

[论文解读] ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration

Junyu Chen, Yufan He|arXiv (Cornell University)|Apr 13, 2021
Advanced Neural Network Applications参考文献 20被引用 139
一句话总结

ViT-V-Net 引入一种混合的 ConvNet-Transformer 架构用于无监督体积医学图像配准,在脑部 MRI 数据上实现比顶级方法更好的 Dice 性能。

ABSTRACT

In the last decade, convolutional neural networks (ConvNets) have dominated and achieved state-of-the-art performances in a variety of medical imaging applications. However, the performances of ConvNets are still limited by lacking the understanding of long-range spatial relations in an image. The recently proposed Vision Transformer (ViT) for image classification uses a purely self-attention-based model that learns long-range spatial relations to focus on the relevant parts of an image. Nevertheless, ViT emphasizes the low-resolution features because of the consecutive downsamplings, result in a lack of detailed localization information, making it unsuitable for image registration. Recently, several ViT-based image segmentation methods have been combined with ConvNets to improve the recovery of detailed localization information. Inspired by them, we present ViT-V-Net, which bridges ViT and ConvNet to provide volumetric medical image registration. The experimental results presented here demonstrate that the proposed architecture achieves superior performance to several top-performing registration methods.

研究动机与目标

  • Motivate deformable image registration (DIR) and address limitations of convolutional networks in modeling long-range spatial relations.
  • Propose a hybrid ViT-ConvNet architecture to enable long-range feature learning for 3D image registration.
  • Demonstrate that ViT-V-Net improves registration accuracy (Dice) and maintains localization via long skip connections.
  • Evaluate against state-of-the-art registration methods on a brain MRI dataset and provide implementation details.

提出的方法

  • Encode high-level features from fixed and moving 3D images through ConvNet blocks and pooling to reduce resolution.
  • Partition high-level features into N patches and apply a Vision Transformer to learn long-range relations.
  • Embed patches with a linear projection and add learnable position embeddings for spatial information.
  • Pass Transformer outputs through a V-Net–style decoder with long skip connections to preserve localization details.
  • Predict a dense displacement field u, warp the moving image with a spatial transformer, and optimize a loss combining MSE similarity and a diffusion regularizer.

实验结果

研究问题

  • RQ1Can a hybrid ConvNet-Transformer architecture improve unsupervised 3D image registration compared to fully ConvNet-based registries?
  • RQ2Do Vision Transformer–based encodings enhance long-range spatial relationships critical for accurate volumetric alignment?
  • RQ3Is the ViT-V-Net architecture able to achieve higher Dice scores than leading DIR methods on brain MRI data?

主要发现

MethodAffineNiftyRegSyNVoxelMorph-1VoxelMorph-2ViT-V-NetDice
Dice0.569 ± 0.1710.713 ± 0.1340.688 ± 0.1400.707 ± 0.1370.711 ± 0.1350.726 ± 0.130
  • ViT-V-Net achieves higher Dice scores than several top registration methods in the tested setup.
  • Reported Dice on the main comparison table: ViT-V-Net 0.726 ± 0.130 vs. others (Affine 0.569 ± 0.171, NiftyReg 0.713 ± 0.134, SyN 0.688 ± 0.140, VoxelMorph-1 0.711 ± 0.135, VoxelMorph-2 0.707 ± 0.137).
  • ViT-V-Net trained with long skip connections retains localization information and shows lower training loss and higher validation Dice.
  • Statistical tests (paired t-test) show ViT-V-Net significantly better than several competitors (p-values listed in the paper).
  • The method runs on GPUs with reported times, highlighting feasibility of the approach for practical use.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。