QUICK REVIEW

[论文解读] VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging

Yufan He, Pengfei Guo|arXiv (Cornell University)|Jun 7, 2024

Medical Imaging Techniques and Applications被引用 5

一句话总结

VISTA3D 提出一个统一的 3D CT 分割基础模型，具有自动与互动分支，在 127 类的 11454 个 CT 扫描上实现了零样本的最先进性能以及强的迁移学习。

ABSTRACT

Foundation models for interactive segmentation in 2D natural images and videos have sparked significant interest in building 3D foundation models for medical imaging. However, the domain gaps and clinical use cases for 3D medical imaging require a dedicated model that diverges from existing 2D solutions. Specifically, such foundation models should support a full workflow that can actually reduce human effort. Treating 3D medical images as sequences of 2D slices and reusing interactive 2D foundation models seems straightforward, but 2D annotation is too time-consuming for 3D tasks. Moreover, for large cohort analysis, it's the highly accurate automatic segmentation models that reduce the most human effort. However, these models lack support for interactive corrections and lack zero-shot ability for novel structures, which is a key feature of "foundation". While reusing pre-trained 2D backbones in 3D enhances zero-shot potential, their performance on complex 3D structures still lags behind leading 3D models. To address these issues, we present VISTA3D, Versatile Imaging SegmenTation and Annotation model, that targets to solve all these challenges and requirements with one unified foundation model. VISTA3D is built on top of the well-established 3D segmentation pipeline, and it is the first model to achieve state-of-the-art performance in both 3D automatic (supporting 127 classes) and 3D interactive segmentation, even when compared with top 3D expert models on large and diverse benchmarks. Additionally, VISTA3D's 3D interactive design allows efficient human correction, and a novel 3D supervoxel method that distills 2D pretrained backbones grants VISTA3D top 3D zero-shot performance. We believe the model, recipe, and insights represent a promising step towards a clinically useful 3D foundation model. Code and weights are publicly available at https://github.com/Project-MONAI/VISTA.

研究动机与目标

为三维 CT 分割定义一个基础模型，覆盖常见解剖结构，并具备开箱即用的精度。
启用互动式细化以提升分割结果。
开发零样本能力，在仅需极少标注的情况下分割未知结构。
提供支持快速适应新类别的训练方案和架构。
在大规模 CT 数据集上进行评估并与任务特定模型进行比较。

提出的方法

两分支架构共享基于 SegResNet 的图像编码器：一个用于开箱即用的 127 类分割的自动头，和一个接受 3D 点提示的互动头。
自动分支使用一个可学习的类别嵌入和一个后映射层来产生类别特定的分割对数。
互动分支将 SAM 的点提示编码器适配到三维，在点头中进行三维下采样并使用类别感知的点嵌入。
训练采用四阶段方案，利用人工标签、来自 TotalSegmentator 的伪标签，以及来自 SAM 的三维超体素，以实现零样本分割。
自动输出与互动输出的合并采用组件级细化策略以保留正确区域（Alg. 1）。
数据构造包括 11454 个 CT 体积，涵盖 127 类，通过伪标签和超体素进行增强；基于 3D 小块的训练并使用滑动窗口推理。

Figure 1: The VISTA3D model contains two branches that share the same image encoder. The top auto-branch performs out-of-the-box automatic segmentation for 127 supported classes. The bottom interactive branch accepts user clicks and performs interactive segmentation on both supported classes and nov

实验结果

研究问题

RQ1是否可以用单个 3D CT 分割模型在多种器官与病变上实现开箱即用的准确表现？
RQ2统一的互动分支是否能够在三维 CT 数据中对未知类别实现有效的零样本分割？
RQ3以伪标签和超体素训练的两分支架构是否能够达到与人工标注的任务特定模型相当的性能？
RQ4在少样本微调以适应新数据集或异常情况方面，VISTA3D 的表现如何？

主要发现

Dataset	Auto3dSeg	nnUNet	TotalSegmentator	VISTA3D auto	VISTA3D point	VISTA3D auto+point
MSD03 Hepatic tumor [3]	0.616	0.617	-	0.588	0.701	0.687
MSD06 Lung tumor [3]	0.562	0.554	-	0.613	0.682	0.719
MSD07 Pancreatic tumor [3]	0.485	0.488	-	0.324	0.603	0.638
MSD08 Hepatic tumor [3]	0.683	0.659	-	0.682	0.733	0.757
MSD09 Spleen [3]	0.965	0.967	0.966	0.952	0.938	0.954
MSD10 Colon tumor [3]	0.475	0.473	-	0.439	0.609	0.633
Airway [43]	0.896	0.899	-	0.852	0.819	0.867
Bone Lesion	0.343	0.396	-	0.491	0.536	0.585
BTCV-Abdomen [37]	0.807	0.825	0.846	0.849	0.815	0.859
BTCV-Cervix [38]	0.598	0.640	0.611	0.672	0.736	0.775
VerSe [40]	0.786	0.828	0.832	0.825	0.896	0.906
AbdomenCT-1K [5]	0.934	0.939	0.921	0.935	0.903	0.940
AMOS22 [21]	0.854	0.854	0.824	0.841	0.785	0.856
TotalSegV2 [7]	0.882	*0.906	*0.942	0.893	0.884	0.918
Average	0.706	0.718	-	0.711	0.760	0.792

VISTA3D 在测试数据上实现了跨 127 类的竞争性开箱自动分割。
互动分支使对未知类别的零样本分割更为有效，性能可通过迭代点提示提升。
在若干数据集上，VISTA3D auto+point 在仅需一次单击的评估中优于基线方法，显示出强大的交互式细化能力。
在少量病例（1 到 10 例）微调时，VISTA3D 相对于基线在 mouse 微型CT 与 WORD 数据集上获得更高的增益。
零样本结果表明，在使用交互式零样本方法时，对外部数据集（小鼠器官和肾上腺/肝肿瘤）的分割有所改进。
在自动分支中引入合成数据进一步提高对多样结构的鲁棒性。

Figure 2: Generated supervoxel from Alg. 2 , showing examples in axial, sagittal, and coronal views. Different colours represent different supervoxels.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。