QUICK REVIEW

[論文レビュー] Scalable Pre-training of Large Autoregressive Image Models

Alaaeldin El-Nouby, Michal Klein|arXiv (Cornell University)|Jan 16, 2024

Domain Adaptation and Few-Shot Learning被引用数 6

ひとこと要約

本論文は Aim を紹介します。Aim は autoregressive objective で事前学習された Vision Transformer ファミリーであり、モデルサイズとデータ量に応じて画像表現品質がスケールすることを示し、目的値と下流の性能を結びつけます。

ABSTRACT

This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the performance of the model on downstream tasks. We illustrate the practical implication of these findings by pre-training a 7 billion parameter AIM on 2 billion images, that achieves 84.0% on ImageNet-1k with a frozen trunk. Interestingly, even at this scale, we observe no sign of saturation in performance, suggesting that AIM potentially represents a new frontier for training large-scale vision models. The pre-training of AIM is similar to the pre-training of LLMs, and does not require any image-specific strategy to stabilize the training at scale.

研究の動機と目的

視覚モデルの自己回帰事前学習を、LLMs に類似した形でスケールさせるモチベーションを提示する。
モデル容量とデータ量の増加が事前学習ロスと下流の正確さを改善することを示す。
自己回帰目的が下流の特徴品質と相関することを示し、スケーラブルな視覚モデルを可能にする。

提案手法

プレフィックス注意機構を用いた ViT バックボーンを使用し、自己回帰の事前学習を維持しつつ下流で双方向の利用を可能にする。
特徴品質を向上させるため、パッチレベルの多くのパラメータを持つMLPヘッドで訓練する。
高品質で多様なデータを混合したDFN-2B+（20億画像）で事前学習を行い、正規化されたパッチに対してピクセルレベルのMSE損失を用いる。
初期のパッチが文脈を形成し、後半のパッチを自動回帰的に予測するプレフィックスLM訓練方式を採用する。
凍結したトランク上でアテンティブ・プロービングを用いて15のベンチマークで特徴を評価する。

Figure 2 : Aim pre-training overview. . Input images are split into non-overlapping patches and embedded linearly following Dosovitskiy et al. 29 . The patch features are fed to a transformer in which the self-attention operation is causally masked to prevent attending to preceding positions. Afterw

実験結果

リサーチクエスチョン

RQ1自己回帰目的はLLMs のように視覚表現を効果的にスケールさせるのか。
RQ2Aim における事前学習ロスと下流特徴品質に対するモデルサイズとデータ量の影響はどうか。
RQ3下流転移を最適化する建築的選択肢（プレフィックス注意、MLPヘッド）は何か。
RQ4視覚自己回帰モデルにおける大規模な事前学習で性能は飽和するのか。

主な発見

Aim は 600M から 7B パラメータへとモデルサイズを増やすと性能がスケールする。
検証時の事前学習損失と下流特徴品質には相関がある。
2B+ 画像での事前学習は強力な下流性能を生み出し、明確な飽和は観察されなかった。
DFN-2B+ と IN-1k のデータ混合は、試験対象データセットの中で最良の下流結果を提供する。
同じ設定では自己回帰目的はマスキング目的を上回る。
凍結したトランクでのアテンティブ・プロービングにより Aim-7B は 15 ベンチマークで強力な結果を示す。

Figure 3 : Prefix causal attention. During pre-training we uniformly sample a prefix length $S$ . The attention for the first $S$ patches are set to be bidirectional and loss is only computed for the remaining patches in the image. During adaptation to downstream tasks, this allows us to drop the at

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。