QUICK REVIEW

[論文レビュー] MiniVLM: A Smaller and Faster Vision-Language Model

Jianfeng Wang, Xiaowei Hu|arXiv (Cornell University)|Dec 13, 2020

Multimodal Machine Learning Applications参考文献 57被引用数 29

ひとこと要約

MiniVLM は、状態技術のモデルである OSCAR${}_{\text{B}}$ の 94–97% の精度を達成しながら、モデルサイズを 73% 減少、FLOPs を 99% 減少させるコンactで効率的な視覚言語モデルです。高速な視覚特徴抽出を実現する Two-stage Efficient feature Extractor (TEE) と、擬似ラベル付き Open Images および高品質な画像タグを用いた事前学習で強化された MiniLM ベースのトランスフォーマーを採用しています。

ABSTRACT

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding $7M$ Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by $73\%$ and the inference time cost by $94\%$ while being able to retain $94-97\%$ of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.

研究の動機と目的

リソース制約のあるデバイスへのデプロイに適した軽量な視覚言語モデルの開発。
下流タスクのパフォーマンスを損なわずに、視覚特徴抽出の計算コストを低減すること。
大規模モデルと大規模データセットを活用することで、小規模モデルの事前学習を改善すること。
最小限のパラメータ数と推論コストで高い精度を達成し、エッジデプロイを可能にすること。

提案手法

EfficientDet をインspiration として採用した Two-stage Efficient feature Extractor (TEE) を設計し、Faster R-CNN と比較して視覚特徴抽出コストを 99% 減少。
計算量を最小限に抑えつつ性能を維持するため、MiniLM アーキテクチャを視覚言語トランスフォーマーに採用。
最先端のキャプションモデルによって生成された 700 万件の擬似ラベル付き Open Images データを用いて MiniVLM を事前学習。
強力なタグモデルから得た高品質な画像タグを活用し、事前学習中のクロスモodal アライメントを強化。
大規模モデルを推論および微調整から分離し、事前学習用のデータ生成と distillation のみに使用。
領域ヘッドの構成を簡素化し、標準的な畳み込みを深度可分畳み込みとポイントワイド畳み込みに置き換えることで、視覚モジュールを最適化。

実験結果

リサーチクエスチョン

RQ1大規模モデルと同等のパフォーマンスを維持しながら、顕著に小型かつ高速な視覚言語モデルを実現できるか？
RQ2視覚言語タスクにおける視覚特徴抽出に、軽量で二段階の検出器がどれほど効果的か？
RQ3擬似ラベル付きデータと高品質なタグを用いた事前学習が、小規模モデルのパフォーマンスにどれほど寄与するか？
RQ4視覚言語モデルにおいて、モデルサイズ、FLOPs、精度の最適なトレードオフは何か？

主な発見

COCO イメージキャプションタスクにおいて、パラメータが 27% の MiniVLM は OSCAR${}_{\text{B}}$ の 97% の CIDEr スコア（119.8 対 123.7）を達成。
複数の下流タスクにおいて、FLOPs を 99% 減少（OSCAR${}_{\text{B}}$ の 1% に）しながら、94–97% の精度を維持。
事前学習中に高品質な画像タグを活用することで、CIDEr で 2 点以上、VQA の精度で 1 点以上向上。
EfficientDet-D0 に類似したバックボーンを備えた TEE-0 は、R101 Faster R-CNN より 3.7 倍小さく、99 倍速く、Visual Genome での検出 mAP は同等。
MiniLM ベースのトランスフォーマーは、視覚言語タスクにおける速度-精度トレードオフにおいて、他のコンパクトな BERT 変種を上回る性能を示す。
トランスフォーマーのランダム初期化が、テキスト事前学習済み重みと同等の結果をもたらすため、小規模モデルが自己教師付き事前学習から効果的に学習可能であることが示唆される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。