QUICK REVIEW

[論文レビュー] PromptCap: Prompt-Guided Task-Aware Image Captioning

Yushi Hu, Hang Hua|arXiv (Cornell University)|Nov 15, 2022

Multimodal Machine Learning Applications被引用数 29

ひとこと要約

PromptCap は自然言語プロンプトに導かれたキャプションを用いて画像とブラックボックスLMを結ぶ問いかけ意識型キャプションを作成し、エンドツーエンドのLM微調整なしで知識ベースのVQAにおける最先端の結果を達成します。

ABSTRACT

Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains.

研究の動機と目的

外部世界の知識が質問に答えるために必要な知識ベースVQAを動機づける。
LMとともにVQAに必要な重要な視覚的詳細を一般的なキャプションが欠くギャップに対処する。
タスク関連の視覚的内容を説明するプロンプト条件付きキャプショニングモデルを提案する。
追加的な注釈なしでキャプショナーを訓練するためのGPT-3を用いたデータ合成とフィルタリングパイプラインを開発する。
PromptCap をGPT-3と組み合わせてVQAを評価し、新しいドメインへの generalization を評価する。

提案手法

ターゲットの質問を含む自然言語プロンプトで条件付けされたキャプションを生成する PromptCap を導入する。
GPT-3 のイン-context 学習を用いて VQA の質問-回答ペアをプロンプト誘導キャプショニングの例に変換して訓練データを合成する。
柔らかいVQA精度ベースの機構でGPT-3生成キャプションをフィルタリングし、高品質な訓練サンプルを選択する。
Image 入力とプロンプトからプロンプト誘導キャプションを生成する OFA ベースのキャプショニングモデルを微調整する。
PromptCap のキャプションをGPT-3に対する PICa風パイプラインの入力として用い、イン-context 学習によるVQAを行う（LMの微調整なし）。
test インスタンスに対するCLIPベースの類似イン-context例の選択によってGPT-3のイン-context学習を改善する。

実験結果

リサーチクエスチョン

RQ1質問を意識した、プロンプト誘導キャプションはブラックボックスLMが知識ベースのVQAを行う能力を向上させるか？
RQ2プロンプト誘導キャプションはGPT-3がVQAの質問に答える能力を可能にする上で、一般的なキャプションより優れているか？
RQ3PromptCap は新しいドメイン（例：WebQA）に対して、タスク固有の微調整なしにどれほど一般化できるか？
RQ4GPT-3、プロンプト設計、およびイン-context例の選択は全体のVQA性能にどの程度寄与するか？

主な発見

PromptCap のキャプションはGPT-3と組み合わせた場合、OK-VQA での60.4%の精度、A-OKVQA での直回答59.6%、選択肢問題73.1%という最先端の結果を達成する。
PromptCap は一般的な OFA キャプション（OFA-Cap）よりもOK-VQAで3.8%、A-OKVQAで5.3%、VQAv2で9.2%向上する。
GPT-3 は知識ベースのVQAにおいて代替LM（例：Flan-T5-XXL）より大きな利得を提供する一方、標準的なVQAv2での利得は小さい。
PromptCap は WebQA に対する8-shot in-context learning へ一般化し、画像クエリに対するオラクルソースを用いたベースラインを上回る。
同様のイン-context例をCLIPベースで検索することでGPT-3のVQA性能がさらに向上する。
定性的分析は、PromptCap が質問に関連する詳細（例：ブランド、色）を引き出して正しいGPT-3の回答を導く一方、一般的なキャプションはしばしば失敗することを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。