QUICK REVIEW

[論文レビュー] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Chao Jia, Yinfei Yang|arXiv (Cornell University)|Feb 11, 2021

Multimodal Machine Learning Applications参考文献 75被引用数 1,195

ひとこと要約

ALIGNは、対照学習を用いたデュアルエンコーダで大規模なノイズ付き画像代替テキストコーパスから視覚表現と視覚言語表現を学習し、視覚およびクロスモーダル検索タスクにおいてゼロショットおよびファインチューニング済みの最先端パフォーマンスを達成する。

ABSTRACT

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.

研究の動機と目的

高価なデータキュレーションを要さず、スケーラブルな視覚および視覚言語表現学習を動機づける。
対照学習損失で学習される1e9件超のノイズ付き画像代替テキストペアを用いたシンプルなデュアルエンコーダアーキテクチャを提案する。
スケールがノイズを補い、視覚およびクロスモーダルタスク全般で強力な転移性能を達成できることを示す。

提案手法

画像エンコーダとしてEfficientNetを、テキストエンコーダとしてBERTを共有埋め込み空間で用いる。
画像対テキストおよびテキスト対画像の双方の方向で正規化されたソフトマックス対照損失を用いて訓練する。
Conceptual Captionsから重い後処理を伴わずに1.8Bの画像代替テキストデータセットを構築するため、頻度ベースの最小限のフィルタリングを適用する。
Flickr30KとMSCOCOでゼロショットおよびファインチューニング済みの検索を評価し、Crisscrossed Captions (CxC) のようなクロスモーダルベンチマークも評価する。
テキストエンコーダでクラス名のプロンプトを用いてゼロショットのImageNet分類を示す。

実験結果

リサーチクエスチョン

RQ1非常に大規模でノイズの多い画像-テキストデータセットで訓練されたシンプルなデュアルエンコーダは、重いフィルタリングなしで最先端のクロスモーダル検索を達成できるか。
RQ2スケールとデータ品質は視覚および視覚言語表現の学習においてどのようにトレードオフするか。
RQ3ゼロショットおよびファインチューニング設定で、画像分類と画像-テキスト検索の転移性能はどの程度達成できるか。
RQ4多言語拡張は非英語データへのモデリングを横断モーダル検索に generalize できるか。
RQ5学習された埋め込みの定性的特性（構成性、テキスト+画像クエリ能力など）はどうか。

主な発見

ALIGNはゼロショットおよびファインチューニング設定で、Flickr30KとMSCOCOにおける画像-テキスト検索で最先端の結果を達成。
ImageNetのゼロショット画像分類で、クラス名プロンプトを使用した場合のトップ1精度は76.4%に達し、CLIPと同等である。
ImageNetでは、視覚分類タスクで画像エンコーダだけを用いてトップ1精度88.64%に達する。
CxCの検索およびSITS指標は、従来のVSEおよびクロスアテンションモデルよりも sizable gains を示し、特に画像→テキストおよびテキスト→画像のリコールで大幅な改善を示す。
多言語ALIGNモデル（ALIGN mling）は100以上の言語で学習され、Multi30Kでゼロショットの多言語画像-テキスト検索のいくつかのベースラインより優れており、クロスリンガル能力を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。