QUICK REVIEW

[论文解读] Compositional Memory for Visual Question Answering

Aiwen Jiang, Fang Wang|arXiv (Cornell University)|Nov 18, 2015

Multimodal Machine Learning Applications参考文献 14被引用 37

一句话总结

本文提出一种组合记忆机制，在长短期记忆（LSTM）框架中动态融合局部视觉特征与序列语言特征，用于视觉问答（VQA）。通过注意力机制建模问题词汇与图像块之间的时序交互，该方法生成代表视觉-语言交互演化的‘事件’，在DAQUAR数据集上相比先前最先进方法提升6%，并在MSCOCO-VQA上表现强劲。

ABSTRACT

Visual Question Answering (VQA) emerges as one of the most fascinating topics in computer vision recently. Many state of the art methods naively use holistic visual features with language features into a Long Short-Term Memory (LSTM) module, neglecting the sophisticated interaction between them. This coarse modeling also blocks the possibilities of exploring finer-grained local features that contribute to the question answering dynamically over time. This paper addresses this fundamental problem by directly modeling the temporal dynamics between language and all possible local image patches. When traversing the question words sequentially, our end-to-end approach explicitly fuses the features associated to the words and the ones available at multiple local patches in an attention mechanism, and further combines the fused information to generate dynamic messages, which we call episode. We then feed the episodes to a standard question answering module together with the contextual visual information and linguistic information. Motivated by recent practices in deep learning, we use auxiliary loss functions during training to improve the performance. Our experiments on two latest public datasets suggest that our method has a superior performance. Notably, on the DARQUAR dataset we advanced the state of the art by 6$\%$, and we also evaluated our approach on the most recent MSCOCO-VQA dataset.

研究动机与目标

为解决VQA中整体视觉特征的局限性，其无法捕捉对准确回答至关重要的细粒度、区域特定信息。
建模问题处理过程中语言与局部视觉特征之间的动态、序列化交互。
通过可学习的记忆机制显式表示演化的视觉-语言证据，以提升VQA中的推理能力。
证明局部特征融合可超越整体特征或仅语言模型的性能。

提出的方法

模型通过LSTM按顺序处理问题词汇，维护随时间演变的隐藏状态。
在每个词汇处，注意力机制根据与当前词汇的相关性重新加权局部图像块（来自CNN特征）的重要性。
将注意力加权后的视觉特征与当前词汇嵌入融合，生成动态的‘事件’——即编码该时间步语言与视觉交互的记忆状态。
将这些事件聚合，并与上下文视觉及语言特征结合，生成最终的答案预测。
通过辅助损失函数端到端训练模型，以提升注意力与推理的精确度。
从CNN的最后一个卷积层提取局部图像块，无需依赖目标提议，以确保密集的空间覆盖。

实验结果

研究问题

RQ1与整体特征相比，建模语言与局部图像区域之间的动态、序列化交互是否能提升VQA性能？
RQ2通过注意力机制实现的视觉与语言特征融合，如何影响VQA中的推理能力？
RQ3局部视觉特征在回答复杂问题与简单问题时，其贡献程度如何？
RQ4所提出的组合记忆机制是否优于仅使用语言或仅使用视觉特征的模型？

主要发现

在DAQUAR数据集上，所提方法相比先前最先进方法实现6%的绝对性能提升，达到新的SOTA水平。
完整模型在WUPS@0.9指标上达到29.77，显著优于‘仅语言’（25.77）和‘仅事件’（27.43）等变体。
语言特征与事件特征的融合使WUPS@0.9从28.73%提升至29.77%，表明二者具有互补优势。
在MSCOCO-VQA测试开发集上，模型达到52.62%的准确率，尽管使用了更大的答案词典，仍与最先进水平相当。
模型在复杂问题类型（如‘what’、‘how’）上的准确率出现显著下降，表明在复杂查询推理方面仍有改进空间。
消融实验确认所有组件——语言、事件及其融合——均不可或缺，各部分均对性能有增量贡献。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。