QUICK REVIEW

[論文レビュー] A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

Jindong Gu, Zhen Han|arXiv (Cornell University)|Jul 24, 2023

Multimodal Machine Learning Applications被引用数 63

ひとこと要約

この論文は、視覚言語基盤モデルのプロンプト工学の包括的な調査を提供し、方法をマルチモーダル-to-text生成、画像-テキストマッチング、テキスト-to-画像生成の3つのカテゴリに分類し、応用と責任あるAIの考慮事項を論じる。

ABSTRACT

Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be created manually as natural language instructions or generated automatically as either natural language instructions or vector representations. Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. In past years, Prompt engineering has been well-studied in natural language processing. Recently, it has also been intensively studied in vision-language modeling. However, there is currently a lack of a systematic overview of prompt engineering on pre-trained vision-language models. This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g. Flamingo), image-text matching models (e.g. CLIP), and text-to-image generation models (e.g. Stable Diffusion). For each type of model, a brief model summary, prompting methods, prompting-based applications, and the corresponding responsibility and integrity issues are summarized and discussed. Furthermore, the commonalities and differences between prompting on vision-language models, language models, and vision models are also discussed. The challenges, future directions, and research opportunities are summarized to foster future research on this topic.

研究の動機と目的

視覚言語基盤モデル（VLM）に関するプロンプト研究の体系的な概要を提供する。
3つのモデルタイプ（マルチモーダル-to-text生成、画像-textマッチング、テキスト-to-画像生成）にわたって、ハードプロンプトとソフトプロンプトに分類する。
各モデルタイプについて、プロンプト手法、応用、AI責任の考慮事項を要約する。
視覚と言語モデルのプロンプトと、言語・ビジョンモデルのプロンプトとの類似点と相違点を論じ、将来の研究方向を示す。

提案手法

プロンプト手法をハードプロンプト（タスク指示、インコンテキスト学習、検索ベースの prompting、チェーン・オブ・ソート）とソフトプロンプト（プロンプトチューニング、プレフィックスチューニング）として分類する。
VL融合モジュールのエンコーダ-デコーダおよびデコーダのみの視点を提供し、これらのアーキテクチャにプロンプト手法を対応づける。
インストラクションと代表モデル（例：Flamingo、CLIP、Stable Diffusion）を含む illustrative examples と representative models を用いて、三つのVLモデルタイプにわたるプロンプト技法と応用を調査する。
エンコーダ-デコーダ対の融合モジュールとデコーダのみの融合モジュールにおけるプロンプトの進展を要約し、ロバスト性、データセットの影響、トレーニング戦略について議論する。
バイアス、ロバスト性、プロンプトによるVLモデルの整合性など、責任あるAIの考慮事項について議論する。
関連論文をまとめたプロジェクトページを公開して文献探索を支援する。

実験結果

リサーチクエスチョン

RQ1視覚言語基盤モデルでどのようなプロンプト手法が用いられ、モデルタイプ間でどう異なるか？
RQ2ハードプロンプトとソフトプロンプトは、マルチモーダル-to-text生成、画像-textマッチング、テキスト-to-画像生成の性能、ロバスト性、安全性にどのような影響を与えるか？
RQ3視覚言語モデルにおけるプロンプトと単一モードの言語/ビジョンモデルのプロンプトにはどのような共通点と相違点があるか？
RQ4視覚言語モデルのプロンプト設計の課題と今後の方向性は？

主な発見

プロンプト手法は、ハードプロンプト（タスク指示、インコンテキスト学習、検索ベース prompting、チェーン・オブ・ソート）とソフトプロンプト（プロンプトチューニング、プレフィックストークンチューニング）に分類される。
エンコーダ-デコーダ融合を用いる視覚言語モデルは多くの場合タスク固有のプロンプトと特殊トークンを使用する一方、デコーダのみの融合モデルは視覚プレフィックスやプロンプティング戦略を用いて生成を導く。
Few-shotおよびZero-shotの promptingは、Flamingo、Kosmos、BLIP-2 などのモデルにまたがるVLMタスク（VQA、画像キャプション、グラウンデッドQA）で高い性能を示す。
言語モデルを凍結させることは、視覚情報を統合しつつ言語能力を保持する一般的な戦略である。
プロンプトチューニングは、全体のファインチューニングよりも頑健性の利点を提供する場合があり、長いプロンプトは一定の点まで性能を向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。