QUICK REVIEW

[論文レビュー] Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Ivona Najdenkoska, Xiantong Zhen|arXiv (Cornell University)|Feb 28, 2023

Multimodal Machine Learning Applications被引用数 8

ひとこと要約

この論文は、軽量なメタマッパーを用いて凍結された視覚モデル（CLIP ViT）と言語モデル（GPT-2）を橋渡しする多模態の少数サンプルメタ学習フレームワークを提案し、データ駆動のタスク同定と少ない勾配更新での迅速適応を可能にします。エピソード訓練はドメイン間およびドメイン内の性能を向上させつつ、計算効率を保ちます。

ABSTRACT

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models and leverage their already learned capacity. By updating the learnable parameters only of the meta-mapper, it learns to accrue shared meta-knowledge among these tasks. Thus, it can rapidly adapt to newly presented samples with only a few gradient updates. Importantly, it induces the task in a completely data-driven manner, with no need for a hand-engineered task induction. We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words and answer visual questions by observing only a limited set of labeled examples. The experimental results show that our meta-learning approach outperforms the baseline across multiple datasets and various training settings while being computationally more efficient.

研究の動機と目的

手作業のタスクプロンプトへの依存を減らすことで、多模態少数ショット設定での高速な適応を促進する。
凍結された視覚バックボーンと言語バックボーンの間に、学習可能なデータ駆動の橋渡し（メタマッパー）を導入する。
関連する多模態タスクを横断するメタ学習が、新しいタスクへ移行可能なメタ知識を生み出すことを示す。
いくつかの多模態ベンチマークにおいて、計算効率性を保ちながら高い性能を示す。

提案手法

三部構成のアーキテクチャ: frozen vision encoder (CLIP ViT-32)、frozen language model (GPT-2) with embedding layer、trainable meta-mapper。
Visual prefixの作成: メタマッパーは視覚特徴を言語モデルが利用する小さな学習可能なprefixトークンに写像する（自己注意ベース）。
Meta-learning setup: Inner loopがタスク特異的パラメータを数回の勾配ステップで適応させ、Outer loopがタスク間で共有メタパラメータを更新。
Autoregressive generation: 言語モデルは学習された視覚prefixと生成トークンを条件に出力を生成する。推論時にはtop-k核サンプリングを使用。
Training protocol: 多模态タスクの分布上でエピソディックなメタ訓練を実施。データセットはN-way, k-shotタスクへ再構成。未 seenタスクでメタテストを実施。
Implementation specifics: CLIP-ViTとGPT-2、768次元の4つの学習可能なメタ-prefixトークン、内側ループ5ステップ、メタ更新にはAdamW、総パラメータ約2百万。

実験結果

リサーチクエスチョン

RQ1データ駆動のメタ learner は凍結された視覚と言語のバックボーンを橋渡しし、少数ラベルでの迅速な多模態適応を可能にするか。
RQ2エピソード的なタスクレベルのメタ訓練は、手作業のタスク誘導と比較して、ドメイン間およびドメイン内の多模態少数ショット性能を向上させるか。
RQ3メタ知識蓄積、メタマッパーのアーキテクチャ、内側ループの最適化が最終的な精度に与える影響はどれか。
RQ4完全に訓練された大規模な多模態モデルと比較して計算効率は高いか。

主な発見

Methods	Real-Name 2-way 1-shot	Real-Name 2-way 5-shot	Open-Ended 2-way 1-shot	Open-Ended 2-way 5-shot	Real-Name 5-way 1-shot	Real-Name 5-way 5-shot	Open-Ended 5-way 1-shot	Open-Ended 5-way 5-shot
Frozen w/o task ind	1.7	-	29.0	-	0.9	-	18.0	-
Frozen w/ task ind	33.7	66.0	53.4	58.9	14.5	33.8	20.2	21.3
Ours (no cross-domain)	35.6	65.7	50.2	57.5	15.2	39.6	18.9	22.0
Cross-domain (✗ ✓)	37.3	66.0	52.5	59.0	19.2	40.3	20.9	25.0
Ours w/ same-domain (✓ ✓)	45.3	69.8	53.6	63.4	24.7	41.8	24.8	28.5

提案されたエピソディック多模態メタリーダーは、複数のベンチマークと設定でFrozenベースラインを上回る。
メタマッパー内のメタ知識の蓄積は重要であり、メタ知識を消去するとデータセット全体で性能が低下する。
自己注意ベースのメタマッパーはMLP変種を大きく上回り、選択的特徴集約の重要性を示す。
学習された視覚 prefix を用いたオープンエンド多模態生成は強力な結果を示し、手作業のタスク誘導を回避する。
異なるショット数を増やすことは、ショットを繰り返すよりも利益が大きく、多様なサポート例の価値を示す。
クロスドメインおよびドメイン内設定でのメタ訓練は頑健性と、少数の勾配更新での適応の速さを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。