QUICK REVIEW

[论文解读] What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Moritz Pawlowsky, Antonis Vamvakeros|arXiv (Cornell University)|Mar 17, 2026

Advanced Electron Microscopy Techniques and Applications被引用 0

一句话总结

论文表明 DINOv2 风格的 ViT 在特征上表现出强烈的位置信偏置，并引入基于 ALiBi 的微调（ALiBi-Dv2）以产生同质化特征，在保持语义的同时降低偏置，从而提升分割和弱监督任务。

ABSTRACT

Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.

研究动机与目标

识别并量化自监督模型中 ViT 特征的位置信偏置。
证明基于 ALiBi 的微调能够在保留语义内容的同时消除位置信偏置。
显示 ALiBi-Dv2 在标准基准（VOC、ADE20K）上的分割性能保持或提升。
证明同质化特征对材料显微图像的可训练分割的好处。

提出的方法

进行线性探针，将 ViT 特征映射到二维斜坡函数并量化每个通道的位置信偏置。
用带有二维感知 ALiBi 位置信编码的圆柱边界和长度归一化的微调 DINOv2 检查点，冻结原始嵌入作为训练目标以实现长度泛化。
在分割基准（VOC07、VOC12、ADE20K）以及显微图像的可训练分割上，将 ALiBi-Dv2 与 NoPE 及其他基线进行比较。
通过 PCA 可视化、余弦相似度和跨多个数据集的 K-means 分解评估特征的同质性。
将 ALiBi-Dv2 特征应用于弱监督任务（K-means 聚类）和可训练分割以评估实际影响。

Figure 2: Linear probe analysis of DINOv2-S features. (a) We train linear probes to map from image features (or individual channels) to randomly sampled (red squares) ramp functions, reporting $R^{2}$ scores on holdout regions. Per-channel scores and predictions (which use all channels) are both ave

实验结果

研究问题

RQ1ViT 特征在不同架构和自监督目标下是否包含线性、易解码的位置信偏置？
RQ2ALiBi 位置信编码是否能够在不牺牲语义内容的情况下产生同质化特征？
RQ3ALiBi 增强的特征在标准基准上的分割性能是否保持或提升，并在材料图像中弱监督分割得到改进？

主要发现

ViT 特征中的位置信偏置广泛存在，在多个通道、跨层和模型（包括 DINOv2、DINOv3）中表现为线性斜坡，但在监督模型中更少。
ALiBi-Dv2 显著减少了通道级和层级的位置信偏置，同时保留语义结构，得到更同质的特征空间。
相较于 DINOv2 和 NoPE，在冻结特征并使用线性探针时，ALiBi-Dv2 在 VOC07、VOC12 与 ADE20K 上实现了可比或更高的平均 IoU。
qualitative feature visualisations (PCA) 显示 ALiBi-Dv2 保留对象分解，但位置信梯度减少，提高了微结构图像的同质性。
ALiBi-Dv2 通过降低位置偏置的分割结果质量问题（如孔隙背向效应），提升了材料-显微图像的可训练分割在挑战性数据上的表现。

Figure 3: Per-channel per-layer ‘positional fingerprint’ of $R^{2}$ scores for DINOv2, DINOv3 and ALiBi-Dv2 for a left-right target ramp. DINOv2 begins with positional information spread across channels (its learned PE is added at the start of the network), which later decreases, whereas for DINOv3

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。