QUICK REVIEW

[論文レビュー] Multi-Modal Foundation Models for Computational Pathology: A Survey

Dong Li, Guihong Wan|ArXiv.org|Mar 12, 2025

AI in cancer detection被引用数 3

ひとこと要約

この調査は、計算病理学のための32の多 modality ファウンデーションモデルをレビューし、それらを Vision-Language、Vision-Knowledge Graph、Vision-Gene Expression の三つのパラダイムに分類し、28の関連データセットを分析します。

ABSTRACT

Foundation models have emerged as a powerful paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. While early developments centered on uni-modal models trained solely on visual data, recent advances have highlighted the promise of multi-modal foundation models that integrate heterogeneous data sources such as textual reports, structured domain knowledge, and molecular profiles. In this survey, we provide a comprehensive and up-to-date review of multi-modal foundation models in CPath, with a particular focus on models built upon hematoxylin and eosin (H&E) stained whole slide images (WSIs) and tile-level representations. We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression. We further divide vision-language models into non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs. Our survey also presents a taxonomy of downstream tasks, highlights training and evaluation strategies, and identifies key challenges and future directions. We aim for this survey to serve as a valuable resource for researchers and practitioners working at the intersection of pathology and AI.

研究の動機と目的

計算病理学（MMFM4CPath）向けの多模態ファウンデーションモデルの景観を調査・整理する。
モデルを三つの主要パラダイム（Vision-Language、Vision-Knowledge Graph、Vision-Gene Expression）に分類する。
Vision-Language モデルを非LLMベースとLLMベースのアプローチに分けて区別する。
病理固有の多模態データセットを整理し、その特徴と用途を概説する。
下流タスク、学習戦略、評価上の考慮事項の分類法を提供する。
スケーラブルで解釈可能な病理AIの課題と今後の方向性を強調する。

提案手法

32のMMFM4CPathモデルをVision-Language（V-L）、Vision-Knowledge Graph（V-KG）、Vision-Gene Expression（V-GE）の三つのパラダイムに系統的分類。
Vision-Languageモデルを非LLMベースとLLMベースのアプローチに細分化。
事前学習目標と戦略の総合分析（SSL、CLIP/CoCa、CMA、PKE、NWP、命令調整、RL）。
病理固有の多模態データセットを28件収集・分類（画像-テキスト、指示、画像-他モダリティ）。
六つの下流タスクカテゴリ（分類、検索、生成、セグメンテーション、予測、VQA）と評価上の考慮事項の分類法。
空間オミクス統合と標準化ベンチマークを含む課題と将来の研究方向性の議論。

実験結果

リサーチクエスチョン

RQ1既存のMMFM4CPathアーキテクチャは何で、Vision-Language、Vision-Knowledge Graph、Vision-Gene Expressionの各パラダイム間でどのように異なるか？
RQ2病理固有データセットは多模態学習のためにどのように整理され、どのような学習/評価戦略が採用されているか？
RQ3MMFM4CPathの事前学習目標と微調整アプローチ（LLMベースの拡張を含む）は何か？
RQ4将来のMMFM4CPathの開発とベンチマークにおける機会と課題は何か？

主な発見

調査は三つのパラダイム（Vision-Language、Vision-Knowledge Graph、Vision-Gene Expression）にまたがる32件の最新MMFM4CPathモデルを網羅的に整理する。
28件の病理データセットを分析し、画像-テキスト対、指示データセット、画像-他モダリティ対に分類して整理する。
Vision-Languageモデルは非LLMベースとLLMベースのアプローチに分かれ、CLIP/CoCaや命令調整などの事前学習目標を採用する。
LLMベースのVision-Languageモデルは微調整と大規模言語モデルとの整合により、病理の生成・対話機能を実現する。
Vision-Knowledge GraphおよびVision-Gene Expressionモデルは、解釈性と生物学的洞察を高めるために構造化ドメイン知識や分子データを統合する。
下流タスクの分類法と学習/評価戦略、今後の研究方向を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。