QUICK REVIEW

[論文レビュー] Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset

Louis Blankemeier, Ashwin Kumar|arXiv (Cornell University)|Jun 10, 2024

Medical Image Segmentation Techniques被引用数 9

ひとこと要約

Merlinは、対になるCTスキャン、EHRコード、放射線レポートを用いて訓練された3Dビジョン-言語ファウンデーションモデルで、ゼロショットおよび適応タスクを実行し、3Dセグメンテーションや放射線レポート生成を含み、単一GPU計算戦略を用いる。

ABSTRACT

The large volume of abdominal computed tomography (CT) scans coupled with the shortage of radiologists have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision-language models (VLMs) that jointly model images and radiology reports. However, current medical VLMs are generally limited to 2D images and short reports. Here to overcome these shortcomings for abdominal CT interpretation, we introduce Merlin, a 3D VLM that learns from volumetric CT scans, electronic health record data and radiology reports. This approach is enabled by a multistage pretraining framework that does not require additional manual annotations. We trained Merlin using a high-quality clinical dataset of paired CT scans (>6 million images from 15,331 CT scans), diagnosis codes (>1.8 million codes) and radiology reports (>6 million tokens). We comprehensively evaluated Merlin on 6 task types and 752 individual tasks that covered diagnostic, prognostic and quality-related tasks. The non-adapted (off-the-shelf) tasks included zero-shot classification of findings (30 findings), phenotype classification (692 phenotypes) and zero-shot cross-modal retrieval (image-to-findings and image-to-impression). The model-adapted tasks included 5-year chronic disease prediction (6 diseases), radiology report generation and 3D semantic segmentation (20 organs). We validated Merlin at scale, with internal testing on 5,137 CT scans and external testing on 44,098 CT scans from 3 independent sites and 2 public datasets. The results demonstrated high generalization across institutions and anatomies. Merlin outperformed 2D VLMs, CT foundation models and off-the-shelf radiology models. We also release our trained models, code, and dataset, available at: https://github.com/StanfordMIMI/Merlin.

研究の動機と目的

腹部CT読影の解釈のための3Dビジョン-言語モデルを開発することで、放射線科医の作業負荷を動機づけ、対処する。
構造化されたEHRデータと非構造化の放射線レポートを活用して、追加の手動注釈なしにCTベースのVLMを監督する。
より広範なタスク群でMerlinを評価し、ゼロショットおよび適応能力を示し、データ要件を分析する。
Merlinが単一GPUで訓練可能で、外部データセットや公開CTデータセットに一般化することを示す。

提案手法

Merlinを、対になったCTスキャン（15,331のCTからの6+百万画像）、EHR診断コード（1.8+百万コード）、および放射線レポート（6+百万トークン）を用いて訓練する。
診断コードをバイナリクロスエントロピー損失に、放射線レポートをInfoNCE損失に用い、段階的またはマルチタスク方式で訓練する。
I3Dベースの初期化と3Dバックボーンを用いて、全体の3D CTボリュームをエンドツーエンドで処理する。
適応なしでZero-shotの所見分類、表現型分類、ゼロショットのクロスモーダル検索を評価し、5年病因予測、放射線レポート生成、3Dセマンティックセグメンテーションを含む適応タスクを評価する。
アブレーションを実施し、バックボーン選択、データタイプ（EHR対レポート）、レポート分割、訓練戦略（段階的 vs マルチタスク）を評価する。
事前訓練データサイズと性能の関連を示すデータスケーリング則を導出し、単一GPUでの訓練を実証する。

Figure 1 : Overview of Merlin training and evaluation. (a) Merlin training strategy. Diagnosis codes from the EHR are used as labels for Merlin training, with a binary cross entropy loss. Radiology reports are also used for training, with an InfoNCE loss Oord et al. ( 2018 ) . Training with diagnosi

実験結果

リサーチクエスチョン

RQ1構造化されたEHRデータと非構造化の放射線レポートの両方で訓練された3Dビジョン-言語モデルは、腹部CT読影においてタスク別のベースラインを上回ることができるか。
RQ2データタイプ（EHRコード対放射線レポート）、モデルアーキテクチャ、訓練戦略が、3D医用VLMのゼロショットおよび適応タスクの性能にどう影響するか。
RQ3Merlinがターゲットの下流性能を達成するためのデータスケーリング要件は何か、そして単一GPU訓練が実用性と一般化にどう影響するか。

主な発見

Merlinは、31の所見で内部平均0.741、外部平均0.647の強力なゼロショットF1を達成し、OpenCLIPおよびBioMedCLIPを上回る（p < 0.001）。
Merlinは692の表現型でmacro-AUROC 0.812を達成、258表現型で0.85超、102表現型で0.9超のAUROCを達成。
ゼロショットのクロスモーダル検索（画像から所見への検索および画像から印象への検索）は、内部データでベースラインを上回り、外部データへは外部ベースラインより5-7倍良く一般化。
5年の多疾患予測で、100%下流ラベル時にAUROC 0.757、10%ラベル時に0.708を達成し、ImageNet事前訓練モデルを上回る。
Merlinは3Dセマンティックセグメンテーションを改善し、特に小さな臓器や複雑な臓器に対して効果的で、データが乏しい状況（10%データ）での頑健性を示す。
Merlinによる放射線レポート生成はRadFMを複数の指標で上回り、解剖学的セクションに整合した構造化レポートを作成。

Figure 2 : Zero-shot findings classification. (a) Depicts how zero-shot classification is performed where text embeddings from disease presence prompts and disease absence prompts are compare to the image embedding. (b) We compare the performance of OpenCLIP Cherti et al. ( 2022 ) , BioMedCLIP Zhang

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。