QUICK REVIEW

[論文レビュー] M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Fan Bai, Yuxin Du|arXiv (Cornell University)|Mar 31, 2024

Medical Imaging and Analysis被引用数 15

ひとこと要約

本論文は大規模な3D医療多模データセット M3D-Data を構築し、多用途な3D MLLM である M3D-LaMed を導入し、8つのタスクのための M3D-Bench を提案する。3D 画像-テキスト検索、レポート生成、VQA、位置決め、セグメンテーションの分野で高い成果を達成している。

ABSTRACT

Medical image analysis is essential to clinical diagnosis and treatment, which is increasingly supported by multi-modal large language models (MLLMs). However, previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. This paper aims to advance 3D medical image analysis with MLLMs. To this end, we present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs specifically tailored for various 3D medical tasks, such as image-text retrieval, report generation, visual question answering, positioning, and segmentation. Additionally, we propose M3D-LaMed, a versatile multi-modal large language model for 3D medical image analysis. Furthermore, we introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks. Through comprehensive evaluation, our method proves to be a robust model for 3D medical image analysis, outperforming existing solutions. All code, data, and models are publicly available at: https://github.com/BAAI-DCAI/M3D.

研究の動機と目的

3D医療画像分析の進展を多模態大規模言語モデル（MLLM）で促進する。
堅牢な多模態タスクを可能にする大規模な3D医療データセットと対応するベンチマークを作成する。
取得、レポート生成、VQA、位置決め、セグメンテーションに対応可能な汎用的な3D MLLM を開発する。
LLM ベースのベンチマークによる自動評価を可能にする。

提案手法

M3D-Cap からのクロスモーダル損失により、ゼロショットで3Dビジョンエンコーダをスクラッチから事前訓練する。
3D空間プーリング・パーサーバーを導入してトークンを削減し、埋め込みをLLMと整列させる。
3Dパーサーバーを介して事前訓練済みLLaMA-2-7B LLMをエンドツーエンド微調整で統合する。
3Dビジョン言語セグメンテーションを可能にするプロンプタブルなセグメンテーションモジュール（SegVol）を組み込む。
パラメータ効率の良い微調整のためにLoRAを使用して、事前知識を保持する。
M3D-Bench を通じて画像-テキスト検索、レポート生成、VQA、位置決め、セグメンテーションなど8つのタスクを評価する。

Figure 1 : The generation pipelines for M3D-Data. (a) In the VQA data generation pipeline, we employ LLM to generate five types of questions from medical reports using a prompt-based method. Subsequently, we eliminate dirty data through self-filtering and check the test set by LLM and experts, achie

実験結果

リサーチクエスチョン

RQ13D医用画像は3Dビジョンエンコーダと3Dパーサーを用いた多模態LLMで効果的に分析できるか？
RQ2大規模な3D医療多模態データセット（M3D-Data）はどのように多様なタスク（検索、RG、VQA、位置決め、セグメンテーション）を支えるか？
RQ3M3D-LaMed は既存のベースラインと比較して8つのタスクでどの程度の性能を示すか？
RQ4プロンプタブルなセグメンテーションモジュールは3D医用画像の指示表現セグメンテーションを実現するか？
RQ5LLMs は3Dタスクの自動評価をLLM ベースのベンチマーク（M3D-Bench）で実現できるか？

主な発見

Methods	Test samples	IR R@1	IR R@5	IR R@10	TR R@1	TR R@5	TR R@10
PMC-CLIP	100	9.00	28.00	45.00	18.00	47.00	59.00
PMC-CLIP	500	4.40	12.80	18.80	7.60	20.20	31.00
PMC-CLIP	1000	1.90	7.60	12.10	4.60	13.00	19.80
PMC-CLIP	2000	1.15	4.35	7.60	3.15	8.55	13.55
Our	100	64.00	95.00	99.00	70.00	95.00	98.00
Our	500	39.60	76.20	87.20	40.40	74.20	87.00
Our	1000	27.30	61.10	76.10	26.60	61.80	75.30
Our	2000	19.10	47.45	62.25	18.45	47.30	62.15

M3D-Data には 120K の 3D 画像-テキストペアと 662K の指示-回答ペアが含まれ、8つのタスクをサポートする。
M3D-LaMed は取得、VQA、位置決め、セグメンテーションを含む幅広いタスクで従来の3D MLLMsを上回る。
3D画像-テキスト検索は、すべてのテスト設定で2Dベースライン（PMC-CLIP）に対して実質的な向上を示し、IRと TR 指標で大幅な改善を達成する。
レポート生成はBLEU、ROUGE、METEOR、BERT-Score、そしてLLNベースの評価で RadFM を上回る。
VQA（閉じた形式・開放形）と位置決めタスクは大きな改善を示し、視覚事前訓練、空間プーリング、MLP設計、解放された視覚エンコーダの重要性を示すablationen が示される。
セグメンテーションタスク（意味的および指示表現）は従来法を上回り、3D で指示表現セグメンテーションを実現する。

Figure 2 : The data statistics of M3D-VQA on five question types. What, which, and where are 3 typical questions. Samples of 5 topics are displayed in word clouds.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。