QUICK REVIEW

[论文解读] REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

Yuanze Lin, Yujia Xie|arXiv (Cornell University)|Jun 2, 2022

Multimodal Machine Learning Applications被引用 44

一句话总结

REVIVE 通过在知识检索和答案生成中结合基于区域的视觉表示来增强基于知识的VQA，取得OK-VQA的最先进性能。它采用以对象为中心的区域、显性/隐性知识，以及基于 FiD 的编码器-解码器来融合模态信息。

ABSTRACT

This paper revisits visual representation in knowledge-based visual question answering (VQA) and demonstrates that using regional information in a better way can significantly improve the performance. While visual representation is extensively studied in traditional VQA, it is under-explored in knowledge-based VQA even though these two tasks share the common spirit, i.e., rely on visual input to answer the question. Specifically, we observe that in most state-of-the-art knowledge-based VQA methods: 1) visual features are extracted either from the whole image or in a sliding window manner for retrieving knowledge, and the important relationship within/among object regions is neglected; 2) visual features are not well utilized in the final answering model, which is counter-intuitive to some extent. Based on these observations, we propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions not only in the knowledge retrieval stage but also in the answering model. The key motivation is that object regions and inherent relationship are important for knowledge-based VQA. We perform extensive experiments on the standard OK-VQA dataset and achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin (+3.6%). We also conduct detailed analysis and show the necessity of regional information in different framework components for knowledge-based VQA. Code is publicly available at https://github.com/yzleroy/REVIVE.

研究动机与目标

通过强调对象中心的区域信息，推动对知识基VQA的改进视觉表征。
系统地研究区域特征如何影响知识检索和最终答案生成。
提出 REVIVE，将区域特征、显性/隐性知识以及基于变换器的答案模型整合。
在 OK-VQA 数据集上展示最先进的性能并分析各组件的贡献。

提出的方法

使用 GLIP 检测对象区域并提取区域级视觉特征。
用 CLIP 对区域进行顶级区域标签描述，并用一个描述模型（VinVL）生成上下文。
利用区域文本描述和基于 CLIP 的匹配从 Wikidata 检索显性知识。
使用区域感知提示对 GPT-3 进行查询，以获取隐性知识和解释。
用 FiD 编码显性与隐性知识、区域视觉特征以及上下文感知问题，并解码答案。
在基于 FiD 的编码器-解码器中，将区域特征与检索到的知识融合以生成答案。

实验结果

研究问题

RQ1区域基于视觉表示是否优于图像全图或滑动窗口方法在知识基VQA上的性能？
RQ2带有区域信息检索的显性和隐性知识如何提升回答准确性？
RQ3区域标签、区域数量和位置坐标对模型性能的影响是什么？
RQ4基于 FiD 的架构是否能够有效将区域级视觉特征与外部知识整合用于答案生成？

主要发现

方法	知识资源	准确度 (%)
Q 仅	-	14.9
MLP	-	20.7
BAN	-	25.1
BAN+AN	Wikipedia	25.6
MUTAN	-	26.4
BAN+KG-AUG	Wikipedia+ConceptNet	26.7
MUTAN+AN	Wikipedia	27.8
ConceptBERT	ConceptNet	33.7
KRISP	Wikipedia + ConceptNet	38.4
Visual Retriever-Reader	Google Search	39.2
MAVEx	Wikipedia+ConceptNet+Google Images	39.4
PICa-Base	Frozen GPT-3 (175B)	43.3
PICa-Full	Frozen GPT-3 (175B)	48.0
KAT (Single)	Wikidata+Frozen GPT-3 (175B)	53.1
KAT (Ensemble)	Wikidata+Frozen GPT-3 (175B)	54.4
REVIVE (Single)	Wikidata+Frozen GPT-3 (175B)	56.6
REVIVE (Ensemble)	Wikidata+Frozen GPT-3 (175B)	58.0

REVIVE 在 OK-VQA 上使用集成达到 58.0% 准确率，超越之前的 SOTA（KAT 集成为 54.4%）。
单模型 REVIVE 实现 56.6% 的准确率，超过之前的单模型基线（例如 KAT 单模型 53.1%）。
基于区域的知识检索在几个十分之一到百分点以上优于基于图像和滑动窗口的方法。
区域标签（30）和区域建议（36）在消融中达到峰值性能。
嵌入位置坐标和区域中心描述在各组件中均稳定提高准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。