Skip to main content
QUICK REVIEW

[论文解读] Soft Equivariance Regularization for Invariant Self-Supervised Learning

Joohyung Lee, Changhun Kim|arXiv (Cornell University)|Mar 4, 2026
Domain Adaptation and Few-Shot Learning被引用 0
一句话总结

SER 在不增加额外变换头的情况下,为不变性 SSL 主干引入层解耦的软等变正则化,提升 ImageNet-1k 线性评估与鲁棒性。它对中间特征图应用解析的群作用,同时保留最终嵌入目标。

ABSTRACT

Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions $ρ_g$ applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only 1.008x training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by +0.84 Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by +1.11/+1.22 Top-1 and frozen-backbone COCO detection by +1.7 mAP. Finally, applying the same layer-decoupling recipe to existing invariance+equivariance baselinesimproves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance.

研究动机与目标

  • 在对同一最终表示施加不变性和等变性时,动机与权衡进行量化。
  • 提出 SER,以在基于 ViT 的自监督学习中解耦不变性和等变性被强制的位置。
  • 提供一个简单、可扩展、无需变换标签的正则化器,作用于特征空间。
  • 证明层解耦在多种基于不变性的自监督学习骨干中提升性能。
  • 展示该方法对鲁棒性与迁移基准的泛化能力。

提出的方法

  • 在 ViT 中插入一个中间的空间令牌映射,并在最终嵌入以标准不变性自监督学习目标进行训练。
  • 对中间令牌映射应用解析指定的特征空间作用 ρ_g(旋转、翻转、缩放)。
  • 将 L_equiv 定义为在空间位置上的补丁级 NT-Xent 风格对比损失,使用相对几何变换 g = g2 g1^{-1}。
  • 将每个批次分为 b1(基线不变性)和 b2(不裁剪的等变视图;保留光度抖动)。
  • 训练 f = f^(2) ∘ f^(1);在等变正则化层之后插入 CLS 令牌,使空间映射在等变学习中保持完好。
  • 将损失函数结合起来:L = L_inv1 + L_inv2 + λ L_equiv,其中 L_inv1/L_inv2 是应用于 b1/b2 的标准 SSL 损失。
Figure 1: Overview of SER. For each image in $b_{2}$ , we sample two views from the equivariant-view policy $\mathcal{T}_{\mathrm{eq}}$ . We decompose each sampled transform into a geometric component $g\in\mathcal{G}$ and a photometric component (e.g., color jitter), and denote by $g_{1},g_{2}\in\m
Figure 1: Overview of SER. For each image in $b_{2}$ , we sample two views from the equivariant-view policy $\mathcal{T}_{\mathrm{eq}}$ . We decompose each sampled transform into a geometric component $g\in\mathcal{G}$ and a photometric component (e.g., color jitter), and denote by $g_{1},g_{2}\in\m

实验结果

研究问题

  • RQ1在跨层解耦不变性和等变性相较于端到端等变方法时,是否能改善 ImageNet-1k 线性评估?
  • RQ2层解耦的 SER 是否能在保持或提升准确率的同时,提升鲁棒性和空间迁移性(如 ImageNet-C/P、COCO 在固定骨干时的检测)?
  • RQ3在多种不变性自监督骨干(MoCo-v3、DINO、Barlow Twins)上,等变正则化是否有益且无需额外变换头?
  • RQ4在网络的何处引入等变性,以在等变性与判别能力之间达到最佳权衡?
  • RQ5层解耦策略是否可推广至提升其他不变+等变基线?

主要发现

  • SER 在严格匹配的两视图设置下(如 MoCo-v3)持续提升 ImageNet-1k 线性准确率,相较强的不变性自监督基线。
  • 在中间的空间表示上施加等变性会带来鲁棒性提升(如 ImageNet-C/P),并提升冻结骨干的 COCO 检测性能。
  • 将等变目标移至中间层后再结合层解耦,可以提升现有的不变+等变方法(如 EquiMod、AugSelf)。
  • 在多种骨干网络(MoCo-v3、DINO、Barlow Twins)上,+SER 的收益仍然存在且不需要架构修改,计算开销也很小。
  • 在等变损失层与 [CLS] 插入之间存在一个“黄金点”;若将等变损失推得过深,线性评估结果会下降。
  • 层解耦被提出作为在 SSL 中结合不变性和等变性的通用设计原则。
Figure 2: An overview of the training pipeline. The mini-batch is split into $b_{1}$ and $b_{2}$ : $b_{1}$ uses the baseline SSL augmentation policy $\mathcal{T}$ (including cropping), while $b_{2}$ uses an equivariant-view policy $\mathcal{T}_{\mathrm{eq}}$ that disables cropping and adds discrete
Figure 2: An overview of the training pipeline. The mini-batch is split into $b_{1}$ and $b_{2}$ : $b_{1}$ uses the baseline SSL augmentation policy $\mathcal{T}$ (including cropping), while $b_{2}$ uses an equivariant-view policy $\mathcal{T}_{\mathrm{eq}}$ that disables cropping and adds discrete

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。