QUICK REVIEW

[論文レビュー] REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

Yuanze Lin, Yujia Xie|arXiv (Cornell University)|Jun 2, 2022

Multimodal Machine Learning Applications被引用数 44

ひとこと要約

REVIVEは、知識ベースのVQAを強化するために、知識検索と回答生成の両方に地域ベースの視覚表現を取り入れ、OK-VQAで最先端の性能を達成します。オブジェクト中心の領域、明示的/暗黙の知識、およびFiDベースのエンコーダ-デコーダを用いてモダリティを統合します。

ABSTRACT

This paper revisits visual representation in knowledge-based visual question answering (VQA) and demonstrates that using regional information in a better way can significantly improve the performance. While visual representation is extensively studied in traditional VQA, it is under-explored in knowledge-based VQA even though these two tasks share the common spirit, i.e., rely on visual input to answer the question. Specifically, we observe that in most state-of-the-art knowledge-based VQA methods: 1) visual features are extracted either from the whole image or in a sliding window manner for retrieving knowledge, and the important relationship within/among object regions is neglected; 2) visual features are not well utilized in the final answering model, which is counter-intuitive to some extent. Based on these observations, we propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions not only in the knowledge retrieval stage but also in the answering model. The key motivation is that object regions and inherent relationship are important for knowledge-based VQA. We perform extensive experiments on the standard OK-VQA dataset and achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin (+3.6%). We also conduct detailed analysis and show the necessity of regional information in different framework components for knowledge-based VQA. Code is publicly available at https://github.com/yzleroy/REVIVE.

研究の動機と目的

知識ベースのVQAのための視覚表現を改良する動機付けとして、オブジェクト中心の領域情報を強調する。
領域ベースの特徴が知識検索と最終的な回答生成にどのように影響するかを体系的に研究する。
REVIVEを提案し、領域特徴、明示/暗黙の知識、およびトランスフォーマー型の回答モデルを統合する。
OK-VQAデータセットでの最先端性能を示し、コンポーネントの貢献を分析する。

提案手法

GLIPで物体領域を検出し、領域ベースの視覚特徴を抽出する。
CLIPによるトップ領域タグを用いて領域を記述し、キャプションモデル（VinVL）で文脈を生成する。
領域ベースのテキスト記述とCLIPベースのマッチングを用いてWikidataから明示的知識を取得する。
領域対応のプロンプトでGPT-3をクエリして暗黙知識と説明を得る。
FiDで明示知識と暗黙知識、領域視覚特徴、文脈対応の質問をエンコードし、回答をデコードする。
領域特徴と取得した知識をFiDベースのエンコーダ-デコーダで統合して回答を生成する。

実験結果

リサーチクエスチョン

RQ1領域ベースの視覚表現は、画像全体またはスライディングウィンドウ方式と比較して、知識ベースVQAの性能を改善するのか。
RQ2領域情報で取得された明示的および暗黙的知識が回答精度にどのように寄与するのか。
RQ3領域タグ、領域数、位置座標がモデル性能に与える影響はどのようなものか。
RQ4FiDベースのアーキテクチャは、領域レベルの視覚特徴を外部知識と効果的に統合して回答生成を行えるのか。

主な発見

Method	Knowledge Resources	Accuracy (%)
Q only	-	14.9
MLP	-	20.7
BAN	-	25.1
BAN+AN	Wikipedia	25.6
MUTAN	-	26.4
BAN+KG-AUG	Wikipedia+ConceptNet	26.7
MUTAN+AN	Wikipedia	27.8
ConceptBERT	ConceptNet	33.7
KRISP	Wikipedia + ConceptNet	38.4
Visual Retriever-Reader	Google Search	39.2
MAVEx	Wikipedia+ConceptNet+Google Images	39.4
PICa-Base	Frozen GPT-3 (175B)	43.3
PICa-Full	Frozen GPT-3 (175B)	48.0
KAT (Single)	Wikidata+Frozen GPT-3 (175B)	53.1
KAT (Ensemble)	Wikidata+Frozen GPT-3 (175B)	54.4
REVIVE (Single)	Wikidata+Frozen GPT-3 (175B)	56.6
REVIVE (Ensemble)	Wikidata+Frozen GPT-3 (175B)	58.0

REVIVEはOK-VQAでアンサンブル時に58.0%の精度を達成し、前例のSOTA（54.4%のKATアンサンブル）を上回った。
単一モデルのREVIVEは56.6%の精度を達成し、以前の単一モデルベースライン（例：KAT単一53.1%）を上回った。
領域ベースの知識取得は、画像ベースおよびスライディングウィンドウ方式を数十分の1ポイント上回る。
領域タグ（30）と領域提案（36）は消去法のAblationでピーク性能を示す。
位置座標と領域中心の説明を組み込むことで、コンポーネントを通じて一貫して精度が向上する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。