QUICK REVIEW

[论文解读] DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet|arXiv (Cornell University)|Apr 14, 2023

Multimodal Machine Learning Applications参考文献 131被引用 1,011

一句话总结

DINOv2 在经过筛选的多样化数据集上训练大规模自监督视觉变换器，生成可迁移、即插即用的视觉特征，在图像级和像素级任务上可与弱监督模型相媲美。

ABSTRACT

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

研究动机与目标

在视觉领域激发类似于自然语言处理基础模型的、任务无关的视觉表征。
证明在精选且多样化的数据上进行自监督预训练即可在不微调的情况下获得可迁移的特征。
开发可扩展的训练技术和数据管道，以实现大模型自监督。
证明蒸馏能够将知识从大模型转移到小模型，同时保持质量。

提出的方法

将 DINO 与 iBOT 损失结合，使用 Sinkhorn-Knopp 中心化和 KoLeo 正则化器来训练判别性自监督模型。
使用图像级和补丁级目标来从 ViT 主干学习特征。
通过基于检索的数据增强管道构建 LVD-142M，作为经过筛选且多样化的预训练数据集，而不依赖文本或元数据。
采用数据高效的训练提升（序列打包、有效注意力、FSDP、随机深度）以扩展到 10 亿参数的 ViT 模型。
应用蒸馏：从冻结的大教师模型训练较小的模型以提升小模型性能。
在训练末期短暂提高分辨率以提升像素级任务的性能，而不进行全成本的高分辨率训练。

实验结果

研究问题

RQ1自监督预训练在一个精选的大规模图像数据集上，是否能产生在多样化任务上开箱即用且具有通用性的视觉特征？
RQ2数据筛选、模型规模和训练优化如何影响自监督视觉特征的质量与迁移性？
RQ3从大型自监督教师模型蒸馏是否能提升小模型在视觉基准上的表现？
RQ4在训练后期进行高分辨率微调对像素级任务有何影响？
RQ5图像级与补丁级目标如何相互作用以支持全局和局部视觉任务？

主要发现

方法	架构	数据	文本监督	kNN 验证	线性验证	ReaL	V2
OpenCLIP	ViT-H/14	LAION-2B	✓	83.2	86.2	89.4	77.2
EVA-CLIP	ViT-g/14	custom ∗	✓	83.5	86.4	89.3	77.4
DINOv2	ViT-S/14	LVD-142M	×	79.0	81.1	86.6	70.9
ViT-B/14	LVD-142M	×	83.5	84.5	88.3	75.1
ViT-L/14	LVD-142M	×	83.5	86.3	89.5	78.0
ViT-g/14	LVD-142M	×	83.5	86.5	89.6	78.4

DINOv2 在多项基准上对冻结特征的表现明显优于先前的自监督方法。
在 LVD-142M 上训练的 ViT-g（十亿级参数）达到在 ImageNet-1k 的性能，与在 ImageNet-22k 上训练的模型相比，在其他基准上表现更优。
从大型 DINOv2 教师进行知识蒸馏，使 ViT-L 的性能优于从头训练。
使用经过筛选的数据集（LVD-142M）比使用未筛选数据在迁移到多样化领域方面表现更好，并且在非 ImageNet 领域通常优于 ImageNet-22k 基线。
在训练末期进行高分辨率训练对于像素级任务带来大部分收益，但成本只是全高分辨率训练的一小部分。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。