QUICK REVIEW

[論文レビュー] Visual Reference Resolution using Attention Memory for Visual Dialog

Paul Hongsuck Seo, Andreas Lehrmann|arXiv (Cornell University)|Sep 23, 2017

Multimodal Machine Learning Applications参考文献 38被引用数 90

ひとこと要約

この論文は、視覚対話における視覚参照を解決するために、過去のアテンションを取得して暫定的なアテンションと動的に融合するアテンション・メモリ機構を導入します。 VisDial で最先端の結果を、はるかに少ないパラメータ数で達成し、合成 MNIST Dialog データセットでも強力な改善を示します。

ABSTRACT

Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references. The model then merges the retrieved attention with a tentative one to obtain the final attention for the current question; specifically, we use dynamic parameter prediction to combine the two attentions conditioned on the question. Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by ~16 % points) in situations, where visual reference resolution plays an important role. Moreover, the proposed model achieves superior performance (~ 2 % points improvement) in the Visual Dialog dataset, despite having significantly fewer parameters than the baselines.

研究の動機と目的

視覚的参照解決を、VQA を超える視覚対話の核心的課題として動機づける。
過去のアテンションを蓄積する連想的アテンション・メモリを提案し、現在の参照解決を支援する。
質問を条件として、暫定的アテンションと取得したアテンションを動的に融合するメカニズムを開発する。
合成 MNIST Dialog データセットと実データの VisDial ベンチマークで有効性を示す。
提案手法のメモリアドレス指定、逐次バイアス、パラメータ効iciency を分析する。

提案手法

過去の対話ステップから (attention, key) のペアを蓄積する associative attention memory を導入する。
現在の質問/履歴から暫定的アテンションを計算し、メモリ・アドレス指定を通じて関連する過去のアテンションを取得する。
現在の質問に基づく暫定的アテンションと取得済みアテンションを融合する動的パラメータ層を用いる。
コンテキストと回答埋め込みからメモリ鍵をオンラインで追加・学習してメモリを populate する。
MNIST Dialog および VisDial の両データセットで、答えに対するクロスエントロピーでエンドツーエンド訓練を行う。

実験結果

リサーチクエスチョン

RQ1過去の視覚アテンションを効果的に取得して、視覚対話のあいまいな参照表現を解決できるか。
RQ2暫定的アテンションと取得済みアテンションを動的に融合することは、相互依存する質問がある対話における grounding と答えの正確性を改善するか。
RQ3提案するアテンション・メモリは、合成・実データの視覚対話ベンチマークにおいて性能とパラメータ効率にどのような影響を与えるか。

主な発見

モデル	+H	ATT	パラメータ数	MRR	R@1	R@5	R@10	MR
回答前提 [24]	–	–	n/a	0.3735	23.55	48.52	53.23	26.50
LF-Q [24]	–	–	0 8.3 M (3.6x)	0.5508	41.24	70.45	79.83	7.08
LF-QH [24]	✓	–	12.4 M (5.4x)	0.5578	41.75	71.45	80.94	6.74
LF-QI [24]	–	–	10.4 M (4.6x)	0.5759	43.33	74.27	83.68	5.87
LF-QIH [24]	✓	–	14.5 M (6.3x)	0.5807	43.82	74.68	84.07	5.78
HRE-QH [24]	✓	–	15.0 M (6.5x)	0.5695	42.70	73.25	82.97	6.11
HRE-QIH [24]	✓	–	16.8 M (7.3x)	0.5846	44.67	74.50	84.22	5.72
MN-QH [24]	✓	–	12.4 M (5.4x)	0.5849	44.03	75.26	84.49	5.68
MN-QIH [24]	✓	–	14.7 M (6.4x)	0.5965	45.55	76.22	85.37	5.46
SAN-QI [9]	–	✓	n/a	0.5764	43.44	74.26	83.72	5.88
HieCoAtt-QI [14]	–	✓	n/a	0.5788	43.51	74.49	83.96	5.84
AMEM-QI	–	✓	1.7 M (0.7x)	0.6196	48.24	78.33	87.11	4.92
AMEM-QIH	✓	✓	2.3 M (1.0x)	0.6192	48.05	78.39	87.12	4.88
AMEM+SEQ-QI	–	✓	1.7 M (0.7x)	0.6227	48.53	78.66	87.43	4.86
AMEM+SEQ-QIH	✓	✓	2.3 M (1.0x)	0.6210	48.40	78.39	87.12	4.92

MNIST Dialog では、メモリ・アドレス指定と逐次的好みを用いると、提案する AMEM モデルが強力なベースラインを上回り、正確さが大幅に向上する。
AMEM は VisDial で、競合モデルよりはるかに少ないパラメータ数でほぼ最先端の結果を達成する。
質問に条件づけられた動的アテンション融合は、固定型や非メモリのベースラインよりも最終アテンションマップを改善する。
メモリ・アドレス指定における逐次的好みの組み込みは、最近のアテンションを強調し、対話構造と一致する。
定性的分析は、過去のアテンションの解釈可能な取得と取得参照の一貫した操作を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。