QUICK REVIEW

[论文解读] Soft Equivariance Regularization for Invariant Self-Supervised Learning

Joohyung Lee, Changhun Kim|arXiv (Cornell University)|Mar 4, 2026

Domain Adaptation and Few-Shot Learning被引用 0

一句话总结

SER 在不增加额外变换头的情况下，为不变性 SSL 主干引入层解耦的软等变正则化，提升 ImageNet-1k 线性评估与鲁棒性。它对中间特征图应用解析的群作用，同时保留最终嵌入目标。

ABSTRACT

Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions $ρ_g$ applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only 1.008x training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by +0.84 Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by +1.11/+1.22 Top-1 and frozen-backbone COCO detection by +1.7 mAP. Finally, applying the same layer-decoupling recipe to existing invariance+equivariance baselinesimproves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance.

研究动机与目标

在对同一最终表示施加不变性和等变性时，动机与权衡进行量化。
提出 SER，以在基于 ViT 的自监督学习中解耦不变性和等变性被强制的位置。
提供一个简单、可扩展、无需变换标签的正则化器，作用于特征空间。
证明层解耦在多种基于不变性的自监督学习骨干中提升性能。
展示该方法对鲁棒性与迁移基准的泛化能力。

提出的方法

在 ViT 中插入一个中间的空间令牌映射，并在最终嵌入以标准不变性自监督学习目标进行训练。
对中间令牌映射应用解析指定的特征空间作用 ρ_g（旋转、翻转、缩放）。
将 L_equiv 定义为在空间位置上的补丁级 NT-Xent 风格对比损失，使用相对几何变换 g = g2 g1^{-1}。
将每个批次分为 b1（基线不变性）和 b2（不裁剪的等变视图；保留光度抖动）。
训练 f = f^(2) ∘ f^(1)；在等变正则化层之后插入 CLS 令牌，使空间映射在等变学习中保持完好。
将损失函数结合起来：L = L_inv1 + L_inv2 + λ L_equiv，其中 L_inv1/L_inv2 是应用于 b1/b2 的标准 SSL 损失。

Figure 1: Overview of SER. For each image in $b_{2}$ , we sample two views from the equivariant-view policy $\mathcal{T}_{\mathrm{eq}}$ . We decompose each sampled transform into a geometric component $g\in\mathcal{G}$ and a photometric component (e.g., color jitter), and denote by $g_{1},g_{2}\in\m

实验结果

研究问题

RQ1在跨层解耦不变性和等变性相较于端到端等变方法时，是否能改善 ImageNet-1k 线性评估？
RQ2层解耦的 SER 是否能在保持或提升准确率的同时，提升鲁棒性和空间迁移性（如 ImageNet-C/P、COCO 在固定骨干时的检测）？
RQ3在多种不变性自监督骨干（MoCo-v3、DINO、Barlow Twins）上，等变正则化是否有益且无需额外变换头？
RQ4在网络的何处引入等变性，以在等变性与判别能力之间达到最佳权衡？
RQ5层解耦策略是否可推广至提升其他不变+等变基线？

主要发现

SER 在严格匹配的两视图设置下（如 MoCo-v3）持续提升 ImageNet-1k 线性准确率，相较强的不变性自监督基线。
在中间的空间表示上施加等变性会带来鲁棒性提升（如 ImageNet-C/P），并提升冻结骨干的 COCO 检测性能。
将等变目标移至中间层后再结合层解耦，可以提升现有的不变+等变方法（如 EquiMod、AugSelf）。
在多种骨干网络（MoCo-v3、DINO、Barlow Twins）上，+SER 的收益仍然存在且不需要架构修改，计算开销也很小。
在等变损失层与 [CLS] 插入之间存在一个“黄金点”；若将等变损失推得过深，线性评估结果会下降。
层解耦被提出作为在 SSL 中结合不变性和等变性的通用设计原则。

Figure 2: An overview of the training pipeline. The mini-batch is split into $b_{1}$ and $b_{2}$ : $b_{1}$ uses the baseline SSL augmentation policy $\mathcal{T}$ (including cropping), while $b_{2}$ uses an equivariant-view policy $\mathcal{T}_{\mathrm{eq}}$ that disables cropping and adds discrete

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。