QUICK REVIEW

[論文レビュー] Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Priya Goyal, Quentin Duval|arXiv (Cornell University)|Feb 16, 2022

Domain Adaptation and Few-Shot Learning被引用数 48

ひとこと要約

tldr: 本論文は、10-billion-parameterのビジョンモデルを、数十億の未整理のインターネット画像上で自己教師あり学習を通じて訓練し、監督学習またはImageNetベースの事前学習と比較して、堅牢性・公正性の向上、およびより広い意味表現を達成する。

ABSTRACT

Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object centric features that perform on par with supervised features on most object-centric downstream tasks. In this work, we question if using this ability, we can learn any salient and more representative information present in diverse unbounded set of images from across the globe. To do so, we train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn. We scale our model size to dense 10 billion parameters to avoid underfitting on a large data size. We extensively study and validate our model performance on over 50 benchmarks including fairness, robustness to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets. The resulting model, not only captures well semantic information, it also captures information about artistic style and learns salient information such as geolocations and multilingual word embeddings based on visual content only. More importantly, we discover that such model is more robust, more fair, less harmful and less biased than supervised models or models trained on object centric datasets such as ImageNet.

研究の動機と目的

世界中の多様で未整理の画像データから学習した自己教師ありビジョンモデルの性質を調査する。
スケール（最大10Bパラメータ）が堅牢性、公正性、およびアウト・オブ・ドメインタスクへの一般化に与える影響を評価する。
下流タスクにおける公正さと偏りを、性別、肌色、地理、年齢の観点で定量化する。
視覚信号だけから地理情報や多言語コンテンツなど非オブジェクト中心の情報をモデルが符号化しているかを探る。

提案手法

1Bの未整理Instagram画像をデータ前処理なしで用い、SwAV自己教師あり学習を用いて10BパラメータのRegNet-Yアーキテクチャを訓練する。
大規模モデル訓練を可能にするため、496GPUを横断するFully Sharded Data Parallel (FSDP) を使用し、メモリ管理のため動的アクティベーションチェックポイントを適用する。
SwAVをプロトタイプ16,000、温度0.1、Sinkhorn反復10回で学習させ、プロトタイプ割り当てを導出する。
プリトレーニング済みモデルを、フェアネス、堅牢性、地理的多様性、細分類認識、画像コピー検出を含む50以上のベンチマークで評価する。
SEER（自己教師あり・未整理データ）を、複数の下流タスクにおける supervised ImageNet pretraining および self-supervised ImageNet pretraining と比較する。

実験結果

リサーチクエスチョン

RQ1自己教師ありビジョンモデルが世界中の膨大な未整理画像で訓練された場合、顕著な情報と変動要因は何か。
RQ2非常に大きなモデル容量で多様な未整理データを訓練することは、オブジェクト中心の監督付きデータセットと比較して堅牢性、公正性、偏りの低減を生むか。
RQ3このようなモデルは、視覚データだけから非オブジェクト中心の信号（例：地理的位置情報、芸術的スタイル、多言語の手掛かり）をどの程度捉えるか。

主な発見

自己教師ありのインターネット画像による事前学習は、監督型やオブジェクト中心の事前学習モデルよりも公正で偏りが少なく、有害性が低いモデルを生み出す。
より大きなモデル（10Bパラメータ）は、埋め込みにおける性別・肌色間の格差を減少させ、モデルサイズの増加とともに公正性を改善する。
データセットの多様性を活用してより堅牢な特徴を学習し、50以上のベンチマークにおけるアウト・オブ・ディストリビューション一般化を向上させる。
SEERは地理的位置情報や多言語の語彙表現など、視覚内容だけに基づく非伝統的な信号を捉える。
訓練データの地理的・人口統計的多様性は、地理的公正性と地域的オブジェクト認識性能の改善につながる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。