QUICK REVIEW

[論文レビュー] VinVL: Revisiting Visual Representations in Vision-Language Models

Pengchuan Zhang, Xiujun Li|arXiv (Cornell University)|Jan 2, 2021

Multimodal Machine Learning Applications参考文献 42被引用数 60

ひとこと要約

著者らは、複数のデータセットで訓練された大規模なオブジェクト中心の視覚検出モデルを開発し、より豊かな視覚特徴を生成する。これを強化されたOscar+のVL事前学習パイプラインと統合し、7つのビジョン-言語タスクで新たな最先端の結果を達成した。

ABSTRACT

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

研究の動機と目的

より豊かな視覚特徴がビジョン-言語の性能に大きく影響することを示す。
VLタスクに対して多様なオブジェクトと属性をカバーする大規模なオブジェクト検出モデルを開発する。
強化された視覚特徴を用いて統一ビジョン-言語モデル（Oscar+）を事前学習・微調整し、複数のVLベンチマークを改善する。

提案手法

COCO、OpenImages、Objects365、Visual Genomeを統合した統一コーパス上で大規模オブジェクト検出器を事前学習し、524の属性を含む1848のオブジェクトクラスを生成する。
属性ブランチを挿入し、Visual Genomeで微調整してオブジェクト-属性検出を強化する。
VLタスクの特徴抽出を高速化する効率的な領域特徴抽出器を使用する。
キャプション/QAを画像タグと領域に整合させる3-wayコントラスト損失でOscar+を事前学習する。
VQA、GQA、NLVR2、画像キャプション、NoCaps、画像/テキスト検索を含む7つのVLタスクでOscar+を微調整する。

実験結果

リサーチクエスチョン

RQ1視覚特徴の質と多様性を向上させると、ビジョン-言語タスク全体の性能が向上するか？
RQ2より大規模で多様なオブジェクト中心の検出器は、TransformerベースのVLフュージョンモデルと組み合わせた場合、下流のVL理解・生成タスクを改善できるか？
RQ3データ、モデルアーキテクチャ、事前学習目的のどの設計選択がVLの向上に最も寄与するか？
RQ4新しい視覚特徴は、認識系タスク（VQA、GQA）と生成/検索タスク（キャプション、NoCaps、検索、NLVR2）の双方の性能にどう影響するか？

主な発見

以前のOD特徴をVinVLのより豊かな領域特徴に置き換えると、7つのVLタスクで一貫して最先端の向上をもたらす。
VinVLの利得は顕著で、分析では全体の改善の約95%を視覚特徴の向上に帰属させている。
新しいオブジェクト検出器は意味的に意味のある領域のカバーを拡大し、オブジェクト概念と属性を豊かにする。
Oscar+はVinVLを用いてVQA、GQA、NLVR2、NoCaps、および検索タスクで新しいSOTAを達成し、画像キャプションでは競争力のある結果を示す。
効率的な領域特徴抽出と属性の組み込みにより、精度を落とさず推論が高速化される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。