QUICK REVIEW

[論文レビュー] Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue|arXiv (Cornell University)|Apr 29, 2022

Multimodal Machine Learning Applications被引用数 1,238

ひとこと要約

tldr: Flamingo は、Perceiver-based visual resampler と gated cross-attention によって、凍結済みの大規模言語モデルを視覚入力と交互に条件付けすることで、さまざまな画像/動画および言語タスクに対して強力な few-shot 学習を実現する Visual Language Model であり、タスク固有のファインチューニングなしにオープンエンド生成を可能にします。

ABSTRACT

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

研究の動機と目的

新規の多模態タスクへの迅速な適応を、最小限の注釈データで促す。
事前学習済みの vision-only モデルと言語モデルを橋渡しして、交互に視覚とテキストデータを扱えるようにする。
画像/動画を条件として固定出力ではなく、オープンエンドな言語生成を可能にする。
多様な vision-language ベンチマークでの few-shot 性能を評価し、デザイン選択を分析する。

提案手法

凍結済みの大規模言語モデル（Chinchilla）をバックボーンとして使用し、視覚入力で条件付けする trainable cross-attention ブロックを挿入する。
画像/動画を、可変サイズの特徴マップから固定数の視覚トークンを生成する Perceiver Resampler で表現する。
プロンプト内でテキストと視覚トークンを交互に配置し、モデルが前のテキストと前に出てくる視覚情報を条件として次のテキストトークンを予測する。
ウェブスクレイプされた vision-language データの混合（HTML テキストと画像が交互、画像-テキスト対、動画-テキスト対）で訓練し、文脈内学習をサポートする。
視覚情報を融合しつつ LM 重みと安定性を保つ tanh-ゲート付き cross-attention メカニズムを採用する。

実験結果

リサーチクエスチョン

RQ1タスク固有のファインチューニングなしで、視覚言語モデルは多様な多模態タスクを few-shot 設定で実行できるのか？
RQ2変動する入力長の画像/動画を交互に含む視覚入力を凍結LMへ条件付けるのに最も適したアーキテクチャ要素は何か？
RQ3交互に混在した視覚言語データと対になったデータの混合で訓練すると、一般化と few-shot 適応にどう影響するのか？
RQ4few examples を用いた文脈内 prompting は、キャプション生成や視覚質問応答のようなオープンエンドタスクをどの程度推進できるのか？

主な発見

Flamingo は、16 の多模态タスクにおいて few-shot 学習で新たな最先端性能を達成した。
6つのタスクで、Flamingo は 32 のタスク固有の例だけでファインチューニングされた SotA と同等またはそれを上回る。
モデル規模とショット数が few-shot 性能を向上させ、より大きなモデルほどより多くのショットを活用できる。
ゲート付き cross-attention と Perceiver Resampler を備えたアーキテクチャは、凍結LMを交互の視覚情報で条件付けしつつ訓練の安定性を維持する。
Flamingo をより多くのデータセットでファインチューニングすると、いくつかのタスク（VQAv2、VATEX、VizWiz、MSRVTTQA、HatefulMemes）で新たな SotA を達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。