QUICK REVIEW

[論文レビュー] Language Models Can See: Plugging Visual Controls in Text Generation

Yixuan Su, Lü Tian|arXiv (Cornell University)|May 5, 2022

Multimodal Machine Learning Applications被引用数 38

ひとこと要約

MAGIC は CLIP ベースの視覚制御を用いて GPT-2 テキスト生成をグラウンドさせるトレーニング不要のデコーディング方式で、ゼロショットの画像キャプショニングと視覚的に grounded なストーリ生成を実現し、最先端の性能と約27倍のデコーディング高速化を達成します。

ABSTRACT

Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.

研究の動機と目的

テキスト以外のモダリティ、特に画像で言語モデル生成をガイドする方法を動機づける。
視覚的内容にテキスト生成をグラウンドさせるトレーニング不要のデコーディングフレームワーク（MAGIC）を提案する。
画像キャプショニングと視覚的 grounded なストーリーテリングでゼロショットの性能を示す。
MAGIC がベースラインを上回り、勾配ベースの手法に対して著しいデコード速度の向上を提供することを示す。

提案手法

MAGIC Search を導入し、トークン選択をガイドする CLIP による magic スコアを追加するデコーディング方式。
magic スコアを、トップ-k 候補トークンに対する CLIP ベースの画像-テキスト類似度分布として定義する（Eq. 5）。
トークン選択の目的関数に magic スコアとともにモデルの信頼度と退化ペナルティを組み込む（Eq. 4）。
表現をキャリブレーションするため、タスク固有のテキストコーパス上でコントラスト学習目的関数を用いて GPT-2 をファインチューニングする（L_MLE + L_CL）。
デコーディング中に勾配更新を必要とせず、効率的なゼロショットのマルチモーダル生成を実現する。
視覚 grounding が可能な任意のテキスト生成タスクと互換性があることを示す。

実験結果

リサーチクエスチョン

RQ1トレーニング不要のデコーディング戦略は、事前学習済み言語モデルに視覚 grounding を効果的に注入できるか？
RQ2CLIP-grounded デコーディングは、勾配ベースの手法と比べてゼロショットの画像キャプショニングの質と速度にどのように影響するか？
RQ3MAGIC はキャプショニングを超えた他のマルチモーダル生成タスク、例えば視覚 grounded なストーリーテリングにも対応できるか？

主な発見

モデル	MS-COCO B@1	MS-COCO B@4	MS-COCO M	MS-COCO R-L	MS-COCO CIDEr	MS-COCO SPICE	Flickr30k B@1	Flickr30k B@4	Flickr30k M	Flickr30k R-L	Flickr30k CIDEr	Flickr30k SPICE	スピード
Supervised Approach	77.2	36.2	27.0	56.4	113.5	20.3	27.3	21.7	-	56.6	16.0	-	-
GVD	-	-	-	-	-	-	66.9	27.3	22.5	-	62.3	16.5	-	-
UniVLP	-	36.5	28.4	-	116.9	21.2	-	30.1	23.0	-	67.4	17.0	-	-
ClipCap	-	33.5	27.5	-	113.1	21.1	-	-	-	-	-	-	-	-
Oscar	-	36.5	30.3	-	123.7	23.1	-	-	-	-	-	-	-	-
LEMON	-	40.3	30.2	-	133.3	23.3	-	-	-	-	-	-	-	-
Weakly Supervised Approach - UIC	41.0	5.6	12.4	28.7	28.6	8.1	-	-	-	-	-	-	-	-
IC-SME	-	6.5	12.9	35.1	22.7	-	-	7.9	13.0	32.8	9.9	-	-	-
S2S-SS	49.5	6.3	14.0	34.5	31.9	8.6	-	-	-	-	-	-	-	-
S2S-GCC	50.4	7.6	13.5	37.3	31.8	8.4	-	-	-	-	-	-	-	-
Unsupervised - Top-k	33.6	2.4	8.3	25.6	3.8	1.7	34.0	2.9	9.0	24.4	3.3	2.7	69.9x	-
Unsupervised - Nucleus	32.6	2.3	7.8	24.8	3.1	1.4	32.6	2.4	8.1	23.4	2.5	2.4	72.5x	-
Unsupervised - Contrastive	39.5	3.0	10.8	30.8	7.7	2.9	37.6	4.3	9.8	25.7	8.9	4.6	1.0x	-
CLIPRe	39.5	4.9	11.4	29.0	13.6	5.3	38.5	5.2	11.6	27.6	10.0	5.7	-	-
ZeroCap	49.8	7.0	15.4	31.8	34.5	9.2	44.7	5.4	11.8	27.3	16.8	6.2	1.0x	-
MAGIC	56.8	12.9	17.4	39.9	49.3	11.3	44.5	6.4	13.1	31.6	20.4	7.1	26.6x	-

MAGIC は MS-COCO および Flickr30k のゼロショット画像キャプショニングベンチマークで、複数指標において最先端の性能を達成する。
MAGIC は勾配ベースの ZeroCap アプローチより約27倍のデコード速度を達成する。
MAGIC はドメイン横断の評価で堅牢性を示し、ベースラインを上回る。
MAGIC は視覚 grounding を伴うストーリー生成にも適用でき、ベースラインより自動品質・人間評価品質の両方で高い成果を示す。
デコード時はトレーニング不要であり、タスク固有のファインチューニングは短くごくわずかなものである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。