QUICK REVIEW

[论文解读] Multi-Modal Foundation Models for Computational Pathology: A Survey

Dong Li, Guihong Wan|ArXiv.org|Mar 12, 2025

AI in cancer detection被引用 3

一句话总结

本综述对32种用于计算病理学的多模态基础模型进行梳理，并将其分为视觉-语言、视觉-知识图谱、视觉-基因表达三大范式，分析28个相关数据集。

ABSTRACT

Foundation models have emerged as a powerful paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. While early developments centered on uni-modal models trained solely on visual data, recent advances have highlighted the promise of multi-modal foundation models that integrate heterogeneous data sources such as textual reports, structured domain knowledge, and molecular profiles. In this survey, we provide a comprehensive and up-to-date review of multi-modal foundation models in CPath, with a particular focus on models built upon hematoxylin and eosin (H&E) stained whole slide images (WSIs) and tile-level representations. We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression. We further divide vision-language models into non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs. Our survey also presents a taxonomy of downstream tasks, highlights training and evaluation strategies, and identifies key challenges and future directions. We aim for this survey to serve as a valuable resource for researchers and practitioners working at the intersection of pathology and AI.

研究动机与目标

对MMFM4CPath的多模态基础模型景观进行系统调查与组织。
将模型分为三个主要范式：视觉-语言、视觉-知识图谱、视觉-基因表达。
区分视觉-语言模型中的非LLM基于方法与LLM基于方法。
编目病理特异性多模态数据集并概述其特征与用途。
提供下游任务、训练策略与评估考量的分类法。
突出可扩展性、可解释性病理AI的挑战与未来方向。

提出的方法

将32个MMFM4CPath模型按三大范式系统性分类：视觉-语言（V-L）、视觉-知识图谱（V-KG）、视觉-基因表达（V-GE）。
将视觉-语言模型再细分为非LLM基于与LLM基于两类。
对预训练目标与策略进行综合分析（SSL、CLIP/CoCa、CMA、PKE、NWP、指令微调、RL）。
汇编并分类28个病理特异性多模态数据集（图像-文本、指令集、图像-其他模态）。
六大下游任务类别（分类、检索、生成、分割、预测、VQA）的分类与评估考量。
讨论挑战与未来研究方向，包括空间组学整合与标准化基准。

实验结果

研究问题

RQ1现有的MMFM4CPath架构有哪些？在视觉-语言、视觉-知识图谱、视觉-基因表达范式之间有何差异？
RQ2病理特异性数据集如何为多模态学习组织，采用了哪些训练/评估策略？
RQ3MMFM4CPath的现行预训练目标与微调方法有哪些，包括LLM扩展？
RQ4未来MMFM4CPath的发展与基准测试存在哪些机遇与挑战？

主要发现

该综述对32个最新的MMFM4CPath模型进行了跨三个范式的编目：视觉-语言、视觉-知识图谱、视觉-基因表达。
分析了28个可用的病理数据集，并将其整理为图像-文本对、指令数据集和图像-其他模态对。
视觉-语言模型分为非LLM基于与LLM基于两类，具有如CLIP/CoCa等多种预训练目标与指令微调。
基于LLM的视觉-语言模型通过微调与大语言模型对齐，具备病理生成与对话能力。
视觉-知识图谱与视觉-基因表达模型整合结构化领域知识或分子数据，以提升可解释性与生物学洞察。
本文提供了下游任务的分类法，并讨论了训练/评估策略及未来研究方向。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。