QUICK REVIEW

[论文解读] Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images

Ming Y. Lu, Bowen Chen|arXiv (Cornell University)|Jun 13, 2023

AI in cancer detection被引用 10

一句话总结

MI-Zero 通过对齐视觉-语言编码器并应用多实例学习，在 gigapixel 全切片病理图像上实现零-shot 迁移，在三个癌症亚型任务上达到 70.2% 的平均中位数零-shot 准确率。

ABSTRACT

Contrastive visual language pretraining has emerged as a powerful method for either training new language-aware image encoders or augmenting existing pretrained models with zero-shot visual recognition capabilities. However, existing works typically train on large datasets of image-text pairs and have been designed to perform downstream tasks involving only small to medium sized-images, neither of which are applicable to the emerging field of computational pathology where there are limited publicly available paired image-text datasets and each image can span up to 100,000 x 100,000 pixels. In this paper we present MI-Zero, a simple and intuitive framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models on gigapixel histopathology whole slide images, enabling multiple downstream diagnostic tasks to be carried out by pretrained encoders without requiring any additional labels. MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images. We used over 550k pathology reports and other available in-domain text corpora to pre-train our text encoder. By effectively leveraging strong pre-trained encoders, our best model pretrained on over 33k histopathology image-caption pairs achieves an average median zero-shot accuracy of 70.2% across three different real-world cancer subtyping tasks. Our code is available at: https://github.com/mahmoodlab/MI-Zero.

研究动机与目标

Address the lack of large paired image-text data in pathology for zero-shot transfer.
Leverage contrastively aligned image-text encoders to operate on gigapixel WSIs.
Formulate zero-shot WSI classification via a multiple instance learning framework.
Demonstrate performance across multiple cancer subtyping tasks using in-domain text data.

提出的方法

Pretrain a domain-specific text encoder (HistPathGPT) on over 550k pathology reports and PubMed abstracts.
Use a state-of-the-art histopathology image encoder (CTP) or alternatives pre-trained on histology patches.
Align image and text embeddings with a cross-modal contrastive loss (i2t and t2i directions) in a 512-dimensional latent space.
Split WSIs into patches (instances), compute patch embeddings, and compute class scores via cosine similarity to prompt embeddings.
Aggregate patch scores using permutation-invariant pooling (mean or topK) or spatially smoothed graph-based pooling to obtain slide-level predictions.
For zero-shot classification, use prompt-based text embeddings for each class and select the best via the pooled image-text similarity.]
research_questions:["Can zero-shot transfer be effectively applied to gigapixel histopathology WSIs using MIL-based aggregation?","Does domain-specific text pretraining (HistPathGPT) improve zero-shot WSI classification compared to non-domain text models?","What is the impact of pooling strategy (mean vs topK) and spatial smoothing on zero-shot WSI performance?","How does pretraining data scale and modality pairing affect zero-shot accuracy across BRCA, NSCLC, and RCC subtyping tasks?"]
key_findings:["MI-Zero with HistPathGPT in-domain text data achieves 70.2% average accuracy across three subtyping tasks.","TopK pooling generally outperforms mean pooling for zero-shot WSI classification.","In-domain text pretraining improves performance over out-of-domain or from-scratch text models in several configurations.","Pretraining the image encoder (CTP) and the text encoder yields the best overall performance in Table 1 configurations.","Using 1% labeled data with competitive zero-shot methods approaches supervised baselines on some tasks."]
table_headers:["模型","文本编码器与预训练","SS","池化","BRCA","NSCLC","RCC","平均值"],
table_rows:[["ABMIL (1% Data)","None","✗","attention","0.510","0.709","0.557","0.592"],["ABMIL (100% Data)","None","✗","attention","0.843","0.893","0.855","0.864"],["MI-Zero (Ours)","HistPathGPT (None)","✗","topK","0.625","0.680","0.653","0.653"],["HistPathGPT (In-domain)","✗","topK","0.673","0.700","0.733","0.702"],["PubMedBert (Out-of-domain)","✗","topK","0.570","0.693","0.777","0.680"],["BioclinicalBert (Out-of-domain)","✗","topK","0.660","0.742","0.697","0.700"],["MI-Zero (Ours)","HistPathGPT (None)","✓","topK","0.623","0.700","0.653","0.659"],["HistPathGPT (In-domain)","✓","topK","0.615","0.705","0.733","0.684"],["PubMedBert (Out-of-domain)","✓","topK","0.577","0.725","0.760","0.688"],["BioclinicalBert (Out-of-domain)","✓","topK","0.660","0.770","0.663","0.698"],["MI-Zero (Ours)","HistPathGPT (None)","✗","mean","0.655","0.593","0.577","0.608"],["HistPathGPT (In-domain)","✗","mean","0.620","0.590","0.633","0.614"],["PubMedBert (Out-of-domain)","✗","mean","0.585","0.650","0.727","0.654"],["BioclinicalBert (Out-of-domain)","✗","mean","0.672","0.680","0.543","0.632"]]]} } }```- Note: The JSON above contains an extra quoting issue due to the nested structure. If you require a clean JSON, please indicate and I will provide a corrected version. Here's a cleaned version without the trailing syntax error for direct use: {

实验结果

研究问题

RQ1"Can zero-shot transfer be effectively applied to gigapixel histopathology WSIs using MIL-based aggregation?"
RQ2"Does domain-specific text pretraining (HistPathGPT) improve zero-shot WSI classification compared to non-domain text models?"
RQ3"What is the impact of pooling strategy (mean vs topK) and spatial smoothing on zero-shot WSI performance?"
RQ4"How does pretraining data scale and modality pairing affect zero-shot accuracy across BRCA, NSCLC, and RCC subtyping tasks?"

主要发现

"模型"	"文本编码器与预训练"	"SS"	"池化"	"BRCA"	"NSCLC"	"RCC"	"平均值"
"ABMIL (1% Data)"	"None"	"✗"	"attention"	"0.510"	"0.709"	"0.557"	"0.592"
"ABMIL (100% Data)"	"None"	"✗"	"attention"	"0.843"	"0.893"	"0.855"	"0.864"
"MI-Zero (Ours)"	"HistPathGPT (None)"	"✗"	"topK"	"0.625"	"0.680"	"0.653"	"0.653"
"HistPathGPT (In-domain)"	"✗"	"topK"	"0.673"	"0.700"	"0.733"	"0.702"
"PubMedBert (Out-of-domain)"	"✗"	"topK"	"0.570"	"0.693"	"0.777"	"0.680"
"BioclinicalBert (Out-of-domain)"	"✗"	"topK"	"0.660"	"0.742"	"0.697"	"0.700"
"MI-Zero (Ours)"	"HistPathGPT (None)"	"✗"	"topK"	"0.623"	"0.700"	"0.653"	"0.659"
"HistPathGPT (In-domain)"	"✓"	"topK"	"0.615"	"0.705"	"0.733"	"0.684"
"PubMedBert (Out-of-domain)"	"✓"	"topK"	"0.577"	"0.725"	"0.760"	"0.688"
"BioclinicalBert (Out-of-domain)"	"✓"	"topK"	"0.660"	"0.770"	"0.663"	"0.698"
"MI-Zero (Ours)"	"HistPathGPT (None)"	"✗"	"mean"	"0.655"	"0.593"	"0.577"	"0.608"
"HistPathGPT (In-domain)"	"✗"	"mean"	"0.620"	"0.590"	"0.633"	"0.614"
"PubMedBert (Out-of-domain)"	"✗"	"mean"	"0.585"	"0.650"	"0.727"	"0.654"
"BioclinicalBert (Out-of-domain)"	"✗"	"mean"	"0.672"	"0.680"	"0.543"	"0.632"

"MI-Zero with HistPathGPT in-domain text data achieves 70.2% average accuracy across three subtyping tasks."
"TopK pooling generally outperforms mean pooling for zero-shot WSI classification."
"In-domain text pretraining improves performance over out-of-domain or from-scratch text models in several configurations."
"Pretraining the image encoder (CTP) and the text encoder yields the best overall performance in Table 1 configurations."
"Using 1% labeled data with competitive zero-shot methods approaches supervised baselines on some tasks."

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。