QUICK REVIEW

[論文レビュー] Data Filtering Networks

Alex Chengyu Fang, Albin Madappally Jose|arXiv (Cornell University)|Sep 29, 2023

Multimodal Machine Learning Applications被引用数 15

ひとこと要約

論文はデータフィルタリングネットワーク（DFN）を学習して大規模な画像-テキストデータセットをキュレーションする。DFN-2B/DFN-5Bは高品質なデータを生み出し、ViT-H/14を用いた84.4%のImageNetゼロショット精度を含む、さまざまな計算予算で最先端のCLIPモデルを実現する。

ABSTRACT

Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art CLIP models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 84.4% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.

研究の動機と目的

大規模データセット構築のフィルタリング段階に焦点を当て、データ中心の改善を動機づける。
高品質な学習データセットを生み出すDFNの特徴を特徴づける。
DFNをゼロから公開データで訓練し、CLIP訓練のための優れたデータセットを誘導できることを実証する。
DFN誘導データセットが精度と計算量のトレードオフでより良い結果を示すことを示す。
公開で再現可能なDFNパイプラインとデータセットを提供し、データセット設計研究を前進させる。）
method:[

提案手法

DFNsを大規模候補プールの要素に適用されるポイントワイズフィルタとして定義する。
DataCompベンチマークを用いて、プールサイズ（中規模・大規模・超大規模）ごとにDFNを評価し、対応するCLIPモデルのハイパーパラメータを利用する。
CLIPベースのフィルタリングを主要なDFNのバックボーンとして採用し、二値分類器、M3AEなどの代替フィルタと比較してフィルタリング手法の頑健性を評価する。
高品質データからDFNを訓練し、それを大規模な未整頓プールのフィルタリングに適用して訓練データセットを誘導する。
標準的なML技術（拡張、初期化、訓練ステップ）でDFNを微調整・アンサンブルしてデータセット品質を向上させる。
DataComp内の38のゼロショットおよびリトリーバル課題を対象に、ImageNet（IN）、INシフト、VTAB、リトリーバル、平均指標を報告して誘導モデルを評価する。

Figure 1 : Compute scaling behavior of training CLIP models on various datasets. DFN-2B, the subset of CommonPool (DataComp-12.8B) chosen by our best performing data filtering networks, out-performs all other datasets including OpenAI’s WIT and the previous state-of-the-art CLIP training dataset Dat

実験結果

リサーチクエスチョン

RQ1DFNがCLIP型モデル向けの高品質誘導データセットを生み出す特性は何か？
RQ2フィルタリングモデルのImageNetでの高性能が、より良いデータフィルタリング品質と下流のデータセットの有用性を予測するか？
RQ3DFNは公開データからゼロから訓練して、独自のフィルタに依存せず最先端データセットを誘導できるか？
RQ4DFN誘導データセットはImageNet、分布シフト、VTAB、リトリーバル課題で既存データセットと比較してどのように性能を発揮するか？
RQ5DFN誘導データセットの品質を最も効果的に改善するレジメン（拡張、初期化、訓練ステップ）は何か？

主な発見

Dataset	Scale	IN	IN Shifts	VTAB	Retrieval	Average
DC-1B	medium	0.297	0.239	0.346	0.231	0.328
DFN-2B	medium	0.371	0.298	0.388	0.288	0.373
DC-1B	large	0.631	0.508	0.546	0.498	0.537
DFN-2B	large	0.678	0.540	0.555	0.534	0.560
LAION-2B	xlarge	0.731	0.603	0.586	0.589	0.601
OpenAI WIT-400M	xlarge	0.755	0.649	0.586	0.543	0.617
DC-1B	xlarge	0.792	0.679	0.652	0.608	0.663
DFN-2B	xlarge	0.814	0.688	0.656	0.649	0.669
LAION-2B (ViT-G/14-224px)	xlarge	0.801	0.691	0.646	0.635	0.667
DC-1B (CLIPA-v2)	xlarge	0.831	0.740	0.645	0.631	0.684
MetaCLIP	xlarge	0.805	0.700	0.640	0.652	0.667
WebLI	xlarge	0.831	0.734	0.648	0.698	0.692
DFN-5B with 224px	xlarge	0.844	0.738	0.685	0.695	0.710
DFN-5B with 378px	xlarge	0.844	0.738	0.685	0.695	0.710

高品質データで訓練された小さなコントラスト画像-テキストモデルだけで、最先端データセットを構築するのに十分である。
フィルタリング強度は画像タスクの性能と相関せず；フィルタリングモデルのImageNet精度が高いことが必ずしもより良いフィルタリング結果を保証しない。
HQITPのような高品質データを用いたDFN訓練は、IN、INシフト、VTAB、およびリトリーバルにおいて優れた誘導データセットを生む。
DFN誘導データセット（DFN-2B/DFN-5B）はDataCompスケールで最先端の結果を達成し、LAION-2B、DC-1B、OpenAI WIT、類似のベースラインを凌駕する。
DFN-2Bで訓練したViT-L/14はゼロショットImageNet精度81.4％、DFN-5Bで訓練したViT-H/14は84.4％のゼロショットImageNet精度を達成し、対応する計算予算の下で競合データセットを上回る。
DFNによって誘導されたデータセットは、ビジョンタスクとVQA（BLIP2）のロバストネスと性能を改善する。

Figure 2 : A high level overview of our pipeline for constructing datasets using DFNs

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。