QUICK REVIEW

[论文解读] Equi-ViT: Rotational Equivariant Vision Transformer for Robust Histopathology Analysis

Fuyao Chen, Yuexi Du|arXiv (Cornell University)|Jan 14, 2026

AI in cancer detection被引用 0

一句话总结

Equi-ViT 将高斯混合环卷积引入 ViT patch 嵌入，以获得对旋转的等变表示，从而在结直肠癌数据集上对旋转鲁棒性优于标准 ViT 和 E(2) ViT。

ABSTRACT

Vision Transformers (ViTs) have gained rapid adoption in computational pathology for their ability to model long-range dependencies through self-attention, addressing the limitations of convolutional neural networks that excel at local pattern capture but struggle with global contextual reasoning. Recent pathology-specific foundation models have further advanced performance by leveraging large-scale pretraining. However, standard ViTs remain inherently non-equivariant to transformations such as rotations and reflections, which are ubiquitous variations in histopathology imaging. To address this limitation, we propose Equi-ViT, which integrates an equivariant convolution kernel into the patch embedding stage of a ViT architecture, imparting built-in rotational equivariance to learned representations. Equi-ViT achieves superior rotation-consistent patch embeddings and stable classification performance across image orientations. Our results on a public colorectal cancer dataset demonstrate that incorporating equivariant patch embedding enhances data efficiency and robustness, suggesting that equivariant transformers could potentially serve as more generalizable backbones for the application of ViT in histopathology, such as digital pathology foundation models.

研究动机与目标

以任意方向的病理图像分析为动机，提升鲁棒性。
开发具备 patch 嵌入内置旋转和镜像等变性的 ViT 骨干。
在公开的结直肠癌数据集上评估旋转鲁棒性与数据效率。
与非等变 ViT 以及最先进的等变 patch 嵌入方法进行对比。
评估所提出嵌入的计算效率与参数数量。

提出的方法

用基于两阶段的 GMR-Conv 的嵌入替换 ViT patch 嵌入，以获得旋转/镜像等变性。
采用 Hugging Face 的 ViT-Base 作为分类头。
使用 AdamW、余弦退火、学习率 5e-5、训练 10 轮、批量大小 64、交叉熵损失进行训练。
在原始测试集和旋转测试集（0–90° 增量）上进行评估，报告均值和标准差。
与非等变 ViTs 及其他等变方法（E(2)-ViT、GMR-R18 等）进行对比。

实验结果

研究问题

RQ1 GMR-Conv 基于 patch 的嵌入是否能从令牌化开始就赋予 ViT 特征旋转和镜像等变性？
RQ2Equi-ViT 相较于标准 ViT 与现有等变 ViT，是否提升了旋转一致的分类性能？
RQ3Equi-ViT 的模型规模、内存占用与旋转鲁棒性之间的权衡如何？
RQ4Equi-ViT 的 patch 嵌入等变性如何影响图像旋转下的令牌级对齐？
RQ5与基于 CNN 的等变模型相比，该方法在病理任务上的数据效率如何？

主要发现

架构	模型	参数量	内存	原始	旋转
CNN	R18	11.2M	3.4G	93.7	87.3 ± 5.1
E(2)-WRN16	10.8M	20.9G	93.8	92.5 ± 3.5
GMR-R18	3.9M	6.2G	95.6	95.2 ± 0.2
ViT	ViT	85M	10.8G	88.2	83.1 ± 6.9
Conv ViT	87M	11.0G	84.8	77.6 ± 7.3
E(2) ViT	94M	28.4G	85.5	74.5 ± 5.1
Equi-ViT	86M	10.9G	87.0	86.8 ± 0.6

Equi-ViT 在旋转测试上的准确率为 86.8 ± 0.59，优于 Standard-ViT（83.1 ± 6.93）和 Conv ViT（77.6 ± 7.32）。
Equi-ViT 在数据集的旋转鲁棒性方面超过 E(2)-ViT（74.5 ± 5.1）。
使用 GMR-Conv 的 patch 嵌入在各旋转角度下几乎实现令牌对齐的完美表现，而非标准 ViT 的令牌特征则未达成。
Equi-ViT 的嵌入模块包含 0.79M 参数（3.0 MB），在内存方面比 Conv ViT 的嵌入更高效（2.4M 参数，9.1 MB）。
消融分析表明 [6, 11] 的 GMR-Conv 配置在旋转性能上最佳（86.8 ± 0.59），相较于其他核或纯 Conv 嵌入。
Equi-ViT 尚未在整体准确性上超越基于 CNN 的等变模型，可能原因是 ViT 对数据量和参数化的需求更高。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。