QUICK REVIEW

[论文解读] Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

Scott Lowe, Anthony Fuller|arXiv (Cornell University)|Mar 16, 2026

Domain Adaptation and Few-Shot Learning被引用 0

一句话总结

Bootleg 通过对多层隐藏教师层进行自蒸馏来训练 ViT，在表示质量上优于 MAE 和 I-JEPA，并提升下游任务表现。它使用掩蔽补丁和来自若干教师层的目标来促进多层次抽象。

ABSTRACT

The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.

研究动机与目标

弥合生成式（像素重建）与预测式（嵌入蒸馏）自监督学习方法之间的差距。
引入一个多层自蒸馏目标，使用教师网络的隐藏层目标。
通过将目标建立在从早期到深层、处理较少的表示上来稳定自监督训练。
展示在图像分类和语义分割任务上的下游性能提升。
探索目标层选择和掩蔽策略如何影响稳定性和性能。

提出的方法

基于 ViT 的编码器-预测器架构，按 I-JEPA 框架使用一个 EMA 教师（Bootleg）。
用四个矩形区域对图像补丁子集进行掩蔽，以创建学习目标。
在编码器深度范围内，从 EMA 教师的多个隐藏层提取作为 z 分数标准化的嵌入，形成目标。
训练学生编码器通过专用预测模块预测被掩蔽位置的连接潜在目标。
将来自多个块的潜在嵌入拼接作为蒸馏目标，以最大化抽象多样性。

实验结果

研究问题

RQ1来自教师的隐藏层自蒸馏能否在 SSL 表示上超越最终层目标？
RQ2针对多个隐藏层、哪些层以及哪些掩蔽策略能带来最佳性能？
RQ3在相似计算约束下，Bootleg 相较 MAE 和 I-JEPA 是否提升下游任务（分类、分割）？
RQ4目标构造的选择（哪些层、多少层、如何合并）对稳定性和性能有何影响？

主要发现

Bootleg 在基线方法上具有优势（例如在 ImageNet-1k 分类和 iNaturalist-21 上比 I-JEPA 提升约 10%）。
从多个隐藏层蒸馏目标比仅使用输入像素或最终嵌入得到的表示更强。
采用四个矩形掩蔽并分布式多层目标的策略能提供稳定的训练，并比 MAE 的均匀随机掩蔽或单目标 I-JEPA 变体有更好表现。
在深度上每第四个块目标化并拼接多个隐藏层表示，能持续提升冻手探针准确率和分割指标。
Bootleg 在 IN-1k、iNat21、ADE20K 和 Cityscapes 上的线性、CLS 与 X-Blk 探针性能均有提升，且在较小模型尺寸下尤为显著。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。