QUICK REVIEW

[论文解读] Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology

Eric Zimmermann, Julian Viret|arXiv (Cornell University)|Feb 25, 2026

AI in cancer detection被引用 0

一句话总结

本论文提出一种区域级混合放大编码器，将来自多种放大倍率的瓦片嵌入进行融合，并通过掩码嵌入建模（MEM）进行预训练，可选使用 CMEM，在跨癌种的生物标志物预测中相较单一放大倍率基线有提升。

ABSTRACT

In recent years, a standard computational pathology workflow has emerged where whole slide images are cropped into tiles, these tiles are processed using a foundation model, and task-specific models are built using the resulting representations. At least 15 different foundation models have been proposed, and the vast majority are trained exclusively with tiles using the 20$ imes$ magnification. However, it is well known that certain histologic features can only be discerned with larger context windows and requires a pathologist to zoom in and out when analyzing a whole slide image. Furthermore, creating 224$ imes$224 pixel crops at 20$ imes$ leads to a large number of tiles per slide, which can be gigapixel in size. To more accurately capture multi-resolution features and investigate the possibility of reducing the number of representations per slide, we propose a region-level mixing encoder. Our approach jointly fuses image tile representations of a mixed magnification foundation model using a masked embedding modeling pretraining step. We explore a design space for pretraining the proposed mixed-magnification region aggregators and evaluate our models on transfer to biomarker prediction tasks representing various cancer types. Results demonstrate cancer dependent improvements in predictive performance, highlighting the importance of spatial context and understanding.

研究动机与目标

Motivate the use of region-level mixed magnification representations to capture multi-scale histologic features beyond a fixed magnification.
Develop a region mixing encoder that aggregates embeddings from multiple magnifications into region-level representations.
Investigate self-supervised pretraining strategies (masked embedding modeling and optional contrastive alignment) to enhance transfer to biomarker prediction tasks.
Evaluate different aggregation strategies (contextualized vs compressed region embeddings) with AB-MIL on seven biomarker tasks across cancer types.

提出的方法

Define a region mixing encoder that consumes an ordered sequence of tile embeddings from multiple magnifications within a spatial region.
Pretrain using masked embedding modeling (MEM) to reconstruct masked region embeddings with a region-aware weighting across magnifications.
Optionally extend MEM with a contrastive alignment (CMEM) to encourage invariance across context augmentations.
Aggregate region embeddings with an attention-based MIL (AB-MIL) to produce slide-level predictions.
Compare contextualized region embeddings (all tokens) with compressed region embeddings (class tokens) for downstream tasks.
Use AUROC to evaluate fine-tuned models on seven MSK-IMPACT biomarker prediction tasks.

实验结果

研究问题

RQ1Does region-level mixed magnification representation learning improve biomarker prediction across diverse tissue types compared to single-magnification baselines?
RQ2What is the impact of MEM vs CMEM pretraining on region-level embeddings for downstream biomarker tasks?
RQ3How do contextualized (patch) versus compressed (CLS) region embeddings fare when integrated with AB-MIL for WSI-level predictions?
RQ4What are the effects of removal ratio and source context size on pretraining effectiveness?
RQ5Can mixed magnification representations reduce sequence length while maintaining or improving performance?

主要发现

Pretraining with MEM or MEM+CMEM improves AUROC on average over baselines and randomly initialized models.
Contextualized region embeddings (patch tokens) generally outperform compressed embeddings (CLS tokens) in AUROC.
MEM-based pretraining yields the strongest average gains across biomarkers and magnifications, with MEM at a 50% removal ratio particularly recommended.
CMEM shows less consistent gains and can underperform especially with CLS token representations.
Across tasks, no single setting is universally best, but MEM consistently improves over AB-MIL at 20x and other baselines, and MEM with 50% masking provides notable gains.
Disabling overly long sequences via region-based mixing reduces computational burden while preserving accuracy.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。