QUICK REVIEW

[論文レビュー] An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Zhengyuan Yang, Zhe Gan|arXiv (Cornell University)|Sep 10, 2021

Multimodal Machine Learning Applications被引用数 46

ひとこと要約

この論文は、GPT-3を画像キャプション/タグと Few-shot in-context 学習と組み合わせて用い、ファインチューニングなしで知識ベースVQAを行う prompting ベースの手法 PICa を提案し、OK-VQA におけるファ few-shot の最先端性能を達成。

ABSTRACT

Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3's power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.

研究の動機と目的

GPT-3の暗黙知識と推論能力を活用した、ファインチューニングなしの知識ベースVQAのシンプルなアプローチを動機づける。
明示的知識検索のミスマッチリスクを排除し、GPT-3をテキスト的画像表現を通じた暗黙的KBとして利用する。
画像のテキスト表現、事例選択、マルチクエリアンサンブルがFew-shot VQA性能に与える影響を体系的に研究する。
厳密なアブレーションと定性的分析を通じて、マルチモーダルタスクにおけるGPT-3の潜在能力と限界を示す。

提案手法

画像をテキスト説明（キャプションまたはタグ）に変換してGPT-3に入力する。
プロンプトヘッドと少数のインコンテキストVQA例を用いて、Few-shot設定でGPT-3をプロンプトする。
キャプション vs タグなどの画像表現の選択とマルチクエリアンサンブルを慎重に設計して性能を向上させる。
類似性（CLIP/RoBERTa）に基づく最も関連する質問/画像を選択するインコンテキスト例選択を実装し、必要に応じて複数の回答を統合するマルチクエリエンサンブルを行う。
GPT-3のFew-shot機能とオープンエンドなテキスト生成による回答を活用してファインチューニングを避ける。
テキスト表現、インコンテキスト例選択、マルチクエリエンサンブルの影響を理解するためのアブレーションを提供する。

実験結果

リサーチクエスチョン

RQ1テキスト画像記述が提供されたとき、GPT-3は視覚ベースの推論の暗黙の非構造的知識ベースとして使用できるか。
RQ2画像のテキスト表現（キャプション、タグ、または組み合わせ）はFew-shot知識ベースVQAの性能にどう影響するか。
RQ3インコンテキスト例選択とマルチクエリエンサンブルはFew-shotレジームのGPT-3ベースVQAを意味的に改善するか。
RQ4OK-VQAやVQAv2のような標準ベンチマークにおけるGPT-3ベースのFew-shot VQAにはどのような限界があるか。

主な発見

方法	画像表現	n=0	n=1	n=4	n=8	n=16	例エンジニアリング
Frozen (Tsimpoukelli et al. 2021)	Feature Emb.	5.9	9.7	12.6	-	-	✗
PICa-Base	Caption	17.5	32.4	37.6	39.6	42.0	✗
PICa-Base	Caption+Tags	16.4	34.0	39.7	41.8	43.3	✗
PICa-Full	Caption	17.7	40.3	44.8	46.1	46.9	✓
PICa-Full	Caption+Tags	17.1	40.8	45.4	46.8	48.0	✓

PICaは16件のインコンテキスト例を用いたキャプション+タグでOK-VQAの監視付き最先端を超える（48.0%）。
キャプションのみを使用した場合、PICa-FullはOK-VQAで16ショット時に46.9%に達し、従来の監視付き手法を上回る。
画像表現としてキャプション、タグ、またはその両方を用いた豊かなテキスト表現は、質問だけのブラインドベースラインを大幅に上回る。
インコンテキスト例選択とマルチクエリエンサンブルはOK-VQAの性能を一貫して向上させ、理想的な例が選択された場合にはオラクル的に49.1%近くまで向上する。
VQAv2では、PICa-Fullはキャプション+タグでファ few-shot設定で56.1%を達成し、Frozenや従来のベースラインよりはるかに優れるが、Oscarの supervise 73.8%にはまだ及ばない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。