QUICK REVIEW

[논문 리뷰] Visual Reference Resolution using Attention Memory for Visual Dialog

Paul Hongsuck Seo, Andreas Lehrmann|arXiv (Cornell University)|2017. 09. 23.

Multimodal Machine Learning Applications참고 문헌 38인용 수 90

한 줄 요약

이 논문은 시각 대화에서 시각적 참조를 해결하기 위해 과거 주의(attentions)를 검색하고 이를 임시 주의와 동적으로 융합하는 주의 메모리 메커니즘을 도입한다. VisDial에서 파라미터 수를 크게 줄이면서도 최첨단 결과를 달성하고 합성 MNIST 대화 데이터셋에서 강력한 이득을 얻는다.

ABSTRACT

Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references. The model then merges the retrieved attention with a tentative one to obtain the final attention for the current question; specifically, we use dynamic parameter prediction to combine the two attentions conditioned on the question. Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by ~16 % points) in situations, where visual reference resolution plays an important role. Moreover, the proposed model achieves superior performance (~ 2 % points improvement) in the Visual Dialog dataset, despite having significantly fewer parameters than the baselines.

연구 동기 및 목표

시각적 참조 해결을 시각 대화의 핵심 과제로 제시한다( VQA를 넘어서).
과거 주의를 저장하는 연상적 주의 메모리를 제안하여 현재 참조 해상을 돕는다.
질문에 조건화된 임시 주의와 검색된 주의를 결합하는 동적 주의 융합 메커니즘을 개발한다.
합성 MNIST 대화 데이터셋과 실제 VisDial 벤치마크에서 효능을 입증한다.
제시된 접근의 메모리 주소 지정, 순차적 편향, 파라미터 효율성을 분석한다.

제안 방법

과거 대화 단계에서의 (주의, 키) 쌍을 저장하는 연상적 주의 메모리를 도입한다.
현재 질문/히스토리에서 임시 주의를 계산하고 메모리 주소 지정을 통해 관련된 과거 주의를 검색한다.
현재 질문에 조건화된 임시 주의와 검색된 주의를 융합하기 위한 동적 매개변수 층을 사용한다.
메모리에 컨텍스트와 정답 임베딩으로부터 메모리 키를 추가하고 온라인으로 저장한다.
MNIST Dialog와 VisDial 데이터셋에 대해 정답에 대한 교차 엔트로피로 엔드투엔드 학습한다.

실험 결과

연구 질문

RQ1과거의 시각적 주의가 시각적 대화에서 모호한 참조 표현을 효과적으로 해결하기 위해 검색될 수 있는가?
RQ2질문에 의존하는 주의의 동적 융합이 상호 의존적 질문이 포함된 대화에서 근거 제시와 정답 정확도를 향상시키는가?
RQ3제시된 주의 메모리가 합성 및 실제 시각 대화 벤치마크에서 성능과 파라미터 효율성에 어떤 영향을 미치는가?

주요 결과

모델	+H	ATT	# 파라미터 수	MRR	R@1	R@5	R@10	MR
Answer prior [24]	–	–	n/a	0.3735	23.55	48.52	53.23	26.50
LF-Q [24]	–	–	0 8.3 M (3.6x)	0.5508	41.24	70.45	79.83	7.08
LF-QH [24]	✓	–	12.4 M (5.4x)	0.5578	41.75	71.45	80.94	6.74
LF-QI [24]	–	–	10.4 M (4.6x)	0.5759	43.33	74.27	83.68	5.87
LF-QIH [24]	✓	–	14.5 M (6.3x)	0.5807	43.82	74.68	84.07	5.78
HRE-QH [24]	✓	–	15.0 M (6.5x)	0.5695	42.70	73.25	82.97	6.11
HRE-QIH [24]	✓	–	16.8 M (7.3x)	0.5846	44.67	74.50	84.22	5.72
MN-QH [24]	✓	–	12.4 M (5.4x)	0.5849	44.03	75.26	84.49	5.68
MN-QIH [24]	✓	–	14.7 M (6.4x)	0.5965	45.55	76.22	85.37	5.46
SAN-QI [9]	–	✓	n/a	0.5764	43.44	74.26	83.72	5.88
HieCoAtt-QI [14]	–	✓	n/a	0.5788	43.51	74.49	83.96	5.84
AMEM-QI	–	✓	1.7 M (0.7x)	0.6196	48.24	78.33	87.11	4.92
AMEM-QIH	✓	✓	2.3 M (1.0x)	0.6192	48.05	78.39	87.12	4.88
AMEM+SEQ-QI	–	✓	1.7 M (0.7x)	0.6227	48.53	78.66	87.43	4.86
AMEM+SEQ-QIH	✓	✓	2.3 M (1.0x)	0.6210	48.40	78.39	87.12	4.92

MNIST Dialog에서 제안된 AMEM 모델은 강력한 기준선보다 성능이 우수하며, 메모리 주소 지정과 순차 선호를 사용할 때 정확도가 크게 향상된다.
AMEM은 대조 모델들에 비해 훨씬 적은 파라미터로 VisDial에서 거의 최첨단 성능에 도달한다.
질문에 조건화된 동적 주의 융합은 고정 또는 비메모리-baseline보다 더 나은 최종 주의 맵을 산출한다.
메모리 주소 지정에서 순차 선호를 도입하면 최근 주의에 무게를 두어 대화 구조에 부합한다.
정성적 분석은 과거 주의의 해석 가능한 검색과 검색된 참조의 일관된 조작을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.