QUICK REVIEW

[论文解读] Visual Reference Resolution using Attention Memory for Visual Dialog

Paul Hongsuck Seo, Andreas Lehrmann|arXiv (Cornell University)|Sep 23, 2017

Multimodal Machine Learning Applications参考文献 38被引用 90

一句话总结

论文为视觉对话引入注意力记忆机制，通过检索过去的注意力并与初步注意力动态融合来解决视觉引用。在 VisDial 上实现了最先进的结果，同时参数显著更少，并在合成 MNIST Dialog 数据集上获得强劲提升。

ABSTRACT

Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references. The model then merges the retrieved attention with a tentative one to obtain the final attention for the current question; specifically, we use dynamic parameter prediction to combine the two attentions conditioned on the question. Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by ~16 % points) in situations, where visual reference resolution plays an important role. Moreover, the proposed model achieves superior performance (~ 2 % points improvement) in the Visual Dialog dataset, despite having significantly fewer parameters than the baselines.

研究动机与目标

将可视参照解析视为超越 VQA 的视觉对话中的核心挑战。
提出一种联想式注意力记忆，存储用于帮助当前参照解析的过去注意力。
开发一个动态注意力融合机制，根据问题对初步注意力与检索到的注意力进行条件化融合。
在合成 MNIST Dialog 数据集和真实 VisDial 基准数据集上证明有效性。
分析所提出方法的记忆寻址、序列偏好与参数效率。

提出的方法

提出一个将过去对话步骤中的 (attention, key) 对存入的联想式注意力记忆。
从当前问题/历史计算初步注意力，并通过记忆寻址检索相关的过去注意力。
使用一个动态参数层，根据当前问题对初步注意力与检索到的注意力进行融合。
从上下文和答案嵌入中追加并学习记忆键，以在线填充记忆。
使用跨熵对 MNIST Dialog 和 VisDial 数据集进行端到端的答案训练。

实验结果

研究问题

RQ1过去的可视注意力是否可以有效检索以解决视觉对话中的模糊引用表达？
RQ2在具有相互依赖问题的对话中，动态融合初步和检索到的注意力是否能提升定位和答案准确性？
RQ3提出的注意力记忆对合成和真实视觉对话基准的性能与参数效率有何影响？

主要发现

Model	+H	ATT	# of params	MRR	R@1	R@5	R@10	MR
Answer prior [24]	–	–	n/a	0.3735	23.55	48.52	53.23	26.50
LF-Q [24]	–	–	0 8.3 M (3.6x)	0.5508	41.24	70.45	79.83	7.08
LF-QH [24]	✓	–	12.4 M (5.4x)	0.5578	41.75	71.45	80.94	6.74
LF-QI [24]	–	–	10.4 M (4.6x)	0.5759	43.33	74.27	83.68	5.87
LF-QIH [24]	✓	–	14.5 M (6.3x)	0.5807	43.82	74.68	84.07	5.78
HRE-QH [24]	✓	–	15.0 M (6.5x)	0.5695	42.70	73.25	82.97	6.11
HRE-QIH [24]	✓	–	16.8 M (7.3x)	0.5846	44.67	74.50	84.22	5.72
MN-QH [24]	✓	–	12.4 M (5.4x)	0.5849	44.03	75.26	84.49	5.68
MN-QIH [24]	✓	–	14.7 M (6.4x)	0.5965	45.55	76.22	85.37	5.46
SAN-QI [9]	–	✓	n/a	0.5764	43.44	74.26	83.72	5.88
HieCoAtt-QI [14]	–	✓	n/a	0.5788	43.51	74.49	83.96	5.84
AMEM-QI	–	✓	1.7 M (0.7x)	0.6196	48.24	78.33	87.11	4.92
AMEM-QIH	✓	✓	2.3 M (1.0x)	0.6192	48.05	78.39	87.12	4.88
AMEM+SEQ-QI	–	✓	1.7 M (0.7x)	0.6227	48.53	78.66	87.43	4.86
AMEM+SEQ-QIH	✓	✓	2.3 M (1.0x)	0.6210	48.40	78.39	87.12	4.92

在 MNIST Dialog 上，所提出的 AMEM 模型优于强基线，当使用记忆寻址和序列偏好时，准确率显著提升。
AMEM 在 VisDial 上实现了接近最先进的结果，并且参数显著少于竞争模型。
以问题为条件的动态注意力融合产生的最终注意力图优于固定或无记忆的基线。
在记忆寻址中加入序列偏好，强调最近的注意力，与对话结构相一致。
定性分析显示对过去注意力的可解释检索及对检索到的引用的一致操作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。