QUICK REVIEW

[论文解读] Multimodal Residual Learning for Visual QA

Jin-Hwa Kim, Sangwoo Lee|arXiv (Cornell University)|Jun 5, 2016

Multimodal Machine Learning Applications参考文献 27被引用 209

一句话总结

MRN 将深度残差学习扩展到多模态视觉问答，通过使用问题与视觉特征的逐元素乘积的联合残差映射，在 VQA Open-Ended 与 Multiple-Choice 任务上达到最新研究水平，并实现隐式视觉注意力可视化。

ABSTRACT

Deep neural networks continue to advance the state-of-the-art of image recognition tasks with various methods. However, applications of these methods to multimodality remain limited. We present Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning. Unlike the deep residual learning, MRN effectively learns the joint representation from vision and language information. The main idea is to use element-wise multiplication for the joint residual mappings exploiting the residual learning of the attentional models in recent studies. Various alternative models introduced by multimodality are explored based on our study. We achieve the state-of-the-art results on the Visual QA dataset for both Open-Ended and Multiple-Choice tasks. Moreover, we introduce a novel method to visualize the attention effect of the joint representations for each learning block using back-propagation algorithm, even though the visual features are collapsed without spatial information.

研究动机与目标

将深度残差学习扩展到多模态视觉问答（VQA）。
在没有显式注意力参数的情况下学习联合视觉-语言表征。
探索替代的多模态捷径配置以识别有效的架构。
在 VQA 数据集的 Open-Ended 与 Multiple-Choice 任务上展示最先进的性能。
引入使用反向传播可视化联合残差注意力效应。

提出的方法

堆叠多个具有残差风格结构的学习块，用于多模态输入。
将联合残差函数 F(k)(q,v) 定义为 tanh(Wq^{(k)}q) ⊙ tanh(W2^{(k)} tanh(W1^{(k)}v))，以融合问题 q 与视觉特征 v。
对视觉通路使用恒等捷径，并为问题通路学习线性投影以对齐维度。
端到端使用 RMSProp 进行训练，使用预计算的视觉特征（VGG-19 或 ResNet-152）和基于 GRU 的问题嵌入。
在 VQA 数据集（Open-Ended 和 Multiple-Choice）上评估，使用变动的答案词汇表（1k/2k/3k），并分析块深度（L）和特征选择。
通过将 V 与 F 的差异向输入反向传播，提供一种注意力效应的可视化方法。

实验结果

研究问题

RQ1多模态残差学习能否在不使用显式注意力机制的情况下有效融合视觉与语言？
RQ2捷径选择和联合残差函数如何影响 VQA 性能？
RQ3视觉特征类型（VGG-19 与 ResNet-152）以及目标答案数量对准确性的影响？
RQ4更深的 MRN 架构（更多学习块）是否能提升 VQA 性能，是否存在收益递减点？
RQ5是否可以通过反向传播从折叠的视觉特征可视化空间注意力效应？

主要发现

MRN 在 VQA 数据集的 Open-Ended 和 Multiple-Choice 任务均达到最先进结果（表格结果显示 MRN 超越若干基线）。
对于 Open-Ended，使用 ResNet-152 特征和 2k 答案时达到 61.84 (All) 和 82.39 (Y/N)，38.23 (Num)，49.41 (Other)。
对于 Multiple-Choice，MRN 达到 66.33 (All) 与 82.41 (Y/N)，39.57 (Num)，58.40 (Other)。
更深的 MRN 块将 Open-Ended 的准确率提升至 L=3 时的 60.53（有 3 块），在 L=4 时略有下降。
ResNet-152 视觉特征显著提高 Open-Ended 与 Multiple-Choice 任务的性能，尤其是在 Other 类别。
MRN 作为一个隐式注意力模型，无显式注意力参数，并且提供一种通过反向传播梯度显示注意力效应的可视化方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。