QUICK REVIEW

[论文解读] Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel, Binxu Wang|ArXiv.org|Oct 8, 2025

Face Recognition and Perception被引用 3

一句话总结

本研究利用稳定稀疏自编码器从 DINOv2 构建了一个包含 32,000 个视觉概念的“大型过完备字典”，分析下游任务如何招募这些概念，并提出 Minkowski 表示假设以描述激活的凸-原型几何。

ABSTRACT

DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

研究动机与目标

为视觉变换器的可解释性动机化并使线性表示假设（LRH）付诸可操作性。
从 DINOv2 激活中使用稀疏自编码器创建一个大且稳定的视觉概念字典（32,000 个原子）。
表征下游任务（分类、分割、深度）如何选择性地招募概念。
超越严格稀疏性的前提下，表征概念字典的几何、稀疏性和一致性。
提出 Minkowski 表示假设（MRH），用于描述作为原型凸混合的 token 形成。

提出的方法

以稳定的稀疏自编码器操作 LRH，将 DINOv2 激活分解为非负编码 Z 和字典 D，且 D 受限于 conv(A) 以实现稳定性。
使用 c = 32,000 原子词典，并强制每个 token 有 k = 8 个活动编码，通过来自 1.4M ImageNet 图像的 128,000 个质心近似 conv(A)。
使用 Adam 训练 50 个 epoch，达到重构保真度 R^2 > 88%。
通过计算预期概念重要性 E(Z W') 来分析下游任务的一致性，作为概念-任务相关性的度量。
对概念激活进行可视化和聚类，以识别任务特定的子空间和原型样结构。

实验结果

研究问题

RQ1DINOv2 编码了哪些内部特征（概念），以及它们如何在几何上组织？
RQ2下游任务（分类、分割、深度估计）如何招募学习到的不同概念子集？
RQ3概念是否形成功能子空间或比严格正交方向更一般的凸原型？
RQ4令牌类型（cls、reg、spatial）在概念激活模式中起到怎样的作用？
RQ5在视觉变换器中 Minkowski 表示假设的经验性特征是什么？

主要发现

下游任务招募不同的概念子集，分类使用更广泛的概念集合，而分割与深度依赖于更局部、低维的子空间。
概念呈现部分密度和一致性；它们的内积比正交模型呈现更重的尾部分布，任务子空间是低维且比随机子集对齐。
按头部对齐的前若干概念显示出同任务内的相似性，指示存在功能子空间而非纯粹正交方向。
分类产生“Elsewhere”概念，在对象存在条件下激活离对象区域，暗示一个结构化的、带有否定逻辑的推断。
分割依赖于沿对象边界局部化的边界概念，形成紧密簇，指示对边缘检测的专业化。
深度概念可聚类为三类：投影几何线索、阴影线索和局部频率变化，反映从二维数据中学习到的单目深度线索。
Register 令牌揭示全球场景属性，仅通过 register-only 概念（包括光照、运动模糊和相机效应）来体现全局非局部特征。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。