QUICK REVIEW

[論文レビュー] Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen|arXiv (Cornell University)|Nov 22, 2021

Multimodal Machine Learning Applications参考文献 55被引用数 340

ひとこと要約

Florence は大規模な、ビジョンと言語のファウンデーションモデルであり、シーンからオブジェクトへの表現、画像から動画への表現、RGB から複数のモダリティへの表現を拡張し、最先端の転移性能と幅広いタスク適応性を実現します。

ABSTRACT

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

研究の動機と目的

空間・時間・モダリティ軸全体の多様なタスクに対応する、事前学習済みモデルとアダプターからなるコンピュータビジョンファウンデーションモデルを定義する。
二塔構造を備えた統一的でウェブスケールな画像-テキスト事前学習フレームワークを構築する。
オブジェクトレベル、動画、ビジョン-言語タスク用のアダプターを開発し、広範な転送性を実現する。
大規模データセット上で事前学習を効率的にスケールさせるためのトレーニングインフラを最適化する。）
method1
method2
method3
method4
method5
method6

提案手法

フィルタリングと UniCL ベースの統一的な画像-テキストコントラスト学習を用いて、9億の画像-テキストペアデータセット (FLD-900M) をキュレーションする。
UniCL を用いて、画像エンコーダ (CoSwin/Hierarchical ViT) と言語エンコーダ (12 層トランスフォーマ) を持つ二塔型 Florence モデルを、画像-ラベル-説明空間で事前学習する。
Dynamic Head アダプターと FLOD-9M を用いて、オブジェクトレベルへの表現を拡張し、オブジェクト検出の事前学習を行う。
METER アダプターを用いて細粒度融合を実現し、ITMとMLM損失での事前学習を通じて V+L 能力を組み込む。
Video CoSwin アダプターを用いて、2D を 3D トークンに変換し、アテンションや位置埋め込みを調整して動画へ適応する。
大規模バッチ・大規模トレーニングを可能にするためのスケーラブルなトレーニング技術（ZeRO、アクティベーション・チェックポイント、混合精度、勾配キャッシュ）を実証する。

実験結果

リサーチクエスチョン

RQ1空間・時間・モダリティを横断する真のコンピュータビジョンファウンデーションモデルとは何か？
RQ2軽量なアダプターを備えた単一の事前学習モデルが、ゼロショット・Few-shot・完全なファインチューニングのレジームにおいて、分類・検索・検出・VQA・キャプション生成・動画タスクなど多様なCVタスクで最先端性能を達成できるか？
RQ3Web規模の画像-テキストデータと統一学習目的が、視覚タスクとモダリティ間の転移可能性にどう影響するか？

主な発見

Florence は 44 の代表的ベンチマークで新しい最先端結果を達成し、 ImageNet-1K のゼロショット top-1 83.74 および top-5 97.18 を含む。
COCO のファインチューニングは 62.4 mAP を達成; VQA スコアは 80.36 に達する; Kinetics-600 は 87.8% の精度を達成。
ゼロショット転移は分類タスクの12件中9件で勝利し、評価スイートのデータセット11件中9件で線形プロービングが勝利。
Flickr30K および MSCOCO でのゼロショットの画像-テキスト検索は競合的または superior な結果を示し、Florence は従来のゼロショット手法を上回る。
FLOD-9M と Dynamic Head による物体検出は、COCO などの検出ベンチマークで強力な AP を達成（ファインチューニング時の COCO AP 62.0 など）。
CD-FSL ベンチマークでクロスドメインの強力な Few-shot 結果を示し、複数の設定で従来の単一モデルベースラインを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。