QUICK REVIEW

[论文解读] Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources

Qi Wu, Peng Wang|arXiv (Cornell University)|Nov 22, 2015

Multimodal Machine Learning Applications参考文献 32被引用 47

一句话总结

该论文提出了一种视觉问答（VQA）模型，通过将深度学习与外部知识库结合，回答关于图像的复杂、自由形式问题，即使答案需要外部知识。通过使用Doc2Vec和LSTM融合图像字幕、检测到的属性以及知识库查询，该模型在Toronto COCO-QA上达到69.73%的准确率，在VQA评估服务器上达到59.44%的准确率，性能达到当前最先进水平。

ABSTRACT

We propose a method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. This allows more complex questions to be answered using the predominant neural network-based approach than has previously been possible. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain the whole answer. The method constructs a textual representation of the semantic content of an image, and merges it with textual information sourced from a knowledge base, to develop a deeper understanding of the scene viewed. Priming a recurrent neural network with this combined information, and the submitted question, leads to a very flexible visual question answering approach. We are specifically able to answer questions posed in natural language, that refer to information not contained in the image. We demonstrate the effectiveness of our model on two publicly available datasets, Toronto COCO-QA and MS COCO-VQA and show that it produces the best reported results in both cases.

研究动机与目标

使视觉问答系统能够回答需要超越图像内容本身知识的复杂、开放式问题。
将来自通用知识库（如DBpedia）的外部知识整合到神经网络VQA框架中。
提升对需要常识或世界知识的问题（如“为什么”和“在哪里”类问题）的性能。
开发一种可泛化、端到端可训练的架构，融合视觉、文本和基于知识的表征。

提出的方法

使用CNN从输入图像中提取高层图像属性（如物体、场景、动作）。
使用当前最先进的图像字幕模型，基于检测到的属性生成多个描述性字幕。
针对前5个属性，生成SPARQL查询，从基于RDF的知识库（如DBpedia）中检索相关文本信息。
使用Doc2Vec将从知识库检索到的文本片段编码为固定长度向量。
将图像属性、生成的字幕以及Doc2Vec编码的知识库内容拼接后输入LSTM网络，生成最终答案。
整个模型通过端到端训练，以最大化训练集中真实答案的似然性。

实验结果

研究问题

RQ1外部知识库能否提升VQA系统在需要超越视觉内容的世界知识问题上的性能？
RQ2神经网络在多大程度上能有效整合视觉属性、图像字幕和外部知识库信息，以回答开放式问题？
RQ3结合多个知识源（属性、字幕、知识库）是否能显著优于仅使用视觉或文本特征的模型？
RQ4在无需构建特定数据集知识库的前提下，能否有效利用DBpedia等通用知识库来支持VQA任务？

主要发现

所提出的模型在Toronto COCO-QA数据集上达到69.73%的最先进准确率，显著优于此前最先进模型的55.92%。
在VQA评估服务器（test-standard）上，模型整体准确率达到59.44%，超越所有先前报告的结果。
对于需要外部常识知识的“为什么”类问题，当引入知识库后性能提升超过50%（从7.77%提升至13.53%），使用完整A+C+K-LSTM模型时表现更优。
该模型在所有问题类别中均表现出色，尤其在依赖外部知识的“为什么”和“在哪里”类问题上取得显著提升。
A+C+K-LSTM模型（使用属性、字幕和知识库）始终优于仅使用图像和问题特征或仅使用图像和字幕特征的模型。
在VQA test-dev数据集上，模型整体准确率为59.17%，其中“是/否”类问题准确率达81.01%，其他类别为45.23%，表明具有强大的泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。