QUICK REVIEW

[論文レビュー] Large-Vocabulary Segmentation for Medical Images with Text Prompts

Ziheng Zhao, Yao Zhang|arXiv (Cornell University)|Dec 28, 2023

Multimodal Machine Learning Applications被引用数 15

ひとこと要約

SATはテキストプロンプト主導の汎用的な医用画像分割モデルで、多模态データ上で31データセット、362クラスの3D分割を実現し、パラメータ規模がわずか107MのSAT-Nanoが専用のnnU-Netsに相当する。

ABSTRACT

This paper aims to build a model that can Segment Anything in 3D medical images, driven by medical terminologies as Text prompts, termed as SAT. Our main contributions are three-fold: (i) We construct the first multimodal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then, we build the largest and most comprehensive segmentation dataset for training, collecting over 22K 3D scans from 72 datasets, across 497 classes, with careful standardization on both image and label space; (ii) We propose to inject medical knowledge into a text encoder via contrastive learning and formulate a large-vocabulary segmentation model that can be prompted by medical terminologies in text form; (iii) We train SAT-Nano (110M parameters) and SAT-Pro (447M parameters). SAT-Pro achieves comparable performance to 72 nnU-Nets -- the strongest specialist models trained on each dataset (over 2.2B parameters combined) -- over 497 categories. Compared with the interactive approach MedSAM, SAT-Pro consistently outperforms across all 7 human body regions with +7.1% average Dice Similarity Coefficient (DSC) improvement, while showing enhanced scalability and robustness. On 2 external (cross-center) datasets, SAT-Pro achieves higher performance than all baselines (+3.7% average DSC), demonstrating superior generalization ability.

研究の動機と目的

データセット横断の統一ラベリングを備えた大規模な多データセット医用分割データを構築する。
セグメンテーションを誘導するために、マルチモーダル医療領域知識をテキストエンコーダに組み込む。
モダリティや領域を横断する多様なターゲットをテキストプロンプトで分割する、普遍的な分割モデルを開発する。

提案手法

e-Anatomy、UMLS、および分割データセットからマルチモーダル医療知識ツリーを構築する。
解剖学的テキストとアトラスベースの視覚概念を整合させるため、知識強化対照学習でテキストおよび視覚エンコーダを事前学習させる。
テキストプロンプトに導かれた3D U-Netバックボーンとトランスフォーマーベースのクエリモジュールおよびマスク生成器を用いてSAT-Nanoを訓練する。
後半段階でテキストエンコーダを凍結する2段階の視覚-言語トレーニングパイプラインを使用する。
31データセットと362クラスをバランスさせるためのデータセット前処理・サンプリング戦略を実装する。

実験結果

リサーチクエスチョン

RQ1単一の普遍モデルは、単なるテキストプロンプトだけを用いて、複数のモダリティにわたる広範な解剖構造と病変を分割できるだろうか？
RQ2訓練コーパスはどの程度大きくあるべきか、どのような知識統合がデータセット間の一般化を改善するのか？
RQ3コンパクトな SAT-Nano は、31データセットにわたってタスク固有の nnU-Nets と同等の性能を達成するか？
RQ4知識強化表現学習が、セグメンテーションプロンプトのテキストと画像の整合性に与える影響は何か？

主な発見

SAT-Nano（107M パラメータ）は、テキストプロンプトを使用して31データセットの362カテゴリを分割できる。
本モデルは、データセット/サブセットごとに訓練された36個の専門 nnU-Nets と同等の性能を達成する。
訓練には31データセットからの11Kの3Dスキャンを用い、体部位横断のデータセット間一般化が効果的であることを示す。
知識注入を伴う2段階の視覚-言語トレーニングは、テキストの医療概念とアトラスベースの視覚特徴の整合性を向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。