QUICK REVIEW

[論文レビュー] DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet|arXiv (Cornell University)|Apr 14, 2023

Multimodal Machine Learning Applications参考文献 131被引用数 1,011

ひとこと要約

DINOv2 は curated, diverse dataset で大規模な自己教師あり視覚変換器を訓練し、転移可能でそのまま使える視覚特徴を生み出し、画像-およびピクセルレベルのタスクで弱教師ありモデルと互換性を持つ。

ABSTRACT

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

研究の動機と目的

Vision の foundation-model 風の、NLP の foundation に類似したタスク非依存の視覚表現を推進する。
curated で多様なデータでの自己教師あり事前学習がファインチュuningなしで転用可能な特徴を生み出せることを示す。
大規模モデルの自己教師ありを可能にするスケーラブルな訓練技術とデータパイプラインを開発する。
distillation が大規模モデルから小型モデルへ知識を転移しつつ品質を維持できることを示す。

提案手法

DINO と iBOT の損失を Sinkhorn-Knopp Centering および KoLeo 正則化を組み合わせて、識別的な自己教師ありモデルを訓練する。
ViT ボトムアップの特徴を学習するために画像レベルおよびパッチレベルの目的を使用する。
Retrieval ベースのデータ拡張パイプラインを介して、テキストやメタデータに依存せずに curated, diverse な pretraining データセットとして LVD-142M を構築する。
1B パラメータの ViT モデルへスケールさせるためのデータ効率的な訓練強化（シーケンスパッキング、効率的アテンション、FSDP、確率的深度）を適用する。
distillation を適用: 凍結した大きな教師モデルから小型モデルを訓練して小型モデルの性能を向上させる。
訓練の終盤で解像度を一時的に高くして、完全な高解像度訓練コストなしにピクセルレベルのタスクを向上させる。

実験結果

リサーチクエスチョン

RQ1 curated で大規模な画像データセットでの自己教師あり事前学習は、多様なタスクに対してアウト・オブ・ザ・ボックスで一般的な視覚特徴を生み出せるか。
RQ2 データ整備、モデルスケーリング、訓練最適化は自己教師あり視覚特徴の品質と転移性にどう影響するか。
RQ3 大規模な自己教師あり教師からの蒸留は小型モデルの視覚ベンチマーク性能を改善するか。
RQ4 後半の高解像度微調整がピクセルレベルのタスクに与える影響はどの程度か。
RQ5 画像レベルとパッチレベルの目的は、グローバルおよび局所の視覚タスクの双方をどのように支援するか。

主な発見

Method	Arch.	Data	Text sup.	kNN val	linear val	ReaL	V2
OpenCLIP	ViT-H/14	LAION-2B	✓	83.2	86.2	89.4	77.2
EVA-CLIP	ViT-g/14	custom ∗	✓	83.5	86.4	89.3	77.4
DINOv2	ViT-S/14	LVD-142M	×	79.0	81.1	86.6	70.9
ViT-B/14	LVD-142M	×	83.5	84.5	88.3	75.1
ViT-L/14	LVD-142M	×	83.5	86.3	89.5	78.0
ViT-g/14	LVD-142M	×	83.5	86.5	89.6	78.4

DINOv2 は frozen features の複数ベンチマークで従来の自己教師あり法より著しく改善を示した。
LVD-142M で訓練された ViT-g billion-parameter モデルは ImageNet-1k の性能に匹敵し、ImageNet-22k で訓練されたモデルより他のベンチマークで優れている。
大規模な DINOv2 教師からの知識蒸留は scratch からの訓練より ViT-L の性能を向上させる。
curated データセット（LVD-142M）を使用する方が非 ImageNet ドメインでより良い転送をもたらし、しばしば ImageNet-22k ベースラインを上回る。
訓練の終盤での高解像度訓練は、ピクセルレベルのタスクにおける利得の大半を、全高解像度訓練のコストを抑えてもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。