QUICK REVIEW

[논문 리뷰] A Focused Dynamic Attention Model for Visual Question Answering

Ilija Ilievski, Shuicheng Yan|arXiv (Cornell University)|2016. 04. 06.

Multimodal Machine Learning Applications참고 문헌 24인용 수 130

한 줄 요약

FDA는 객체 영역에 대한 질문-가이드 중심의 집중적 동적 주의력을 사용하고, 질문과 지역적 및 글로벌 시각 특성을 LSTMs를 통해 결합하여 오픈 엔드 및 다지선다형 VQA 벤치마크에서 최첨단 성능을 달성한다.

ABSTRACT

Visual Question and Answering (VQA) problems are attracting increasing interest from multiple research disciplines. Solving VQA problems requires techniques from both computer vision for understanding the visual contents of a presented image or video, as well as the ones from natural language processing for understanding semantics of the question and generating the answers. Regarding visual content modeling, most of existing VQA methods adopt the strategy of extracting global features from the image or video, which inevitably fails in capturing fine-grained information such as spatial configuration of multiple objects. Extracting features from auto-generated regions -- as some region-based image recognition methods do -- cannot essentially address this problem and may introduce some overwhelming irrelevant features with the question. In this work, we propose a novel Focused Dynamic Attention (FDA) model to provide better aligned image content representation with proposed questions. Being aware of the key words in the question, FDA employs off-the-shelf object detector to identify important regions and fuse the information from the regions and global features via an LSTM unit. Such question-driven representations are then combined with question representation and fed into a reasoning unit for generating the answers. Extensive evaluation on a large-scale benchmark dataset, VQA, clearly demonstrate the superior performance of FDA over well-established baselines.

연구 동기 및 목표

VQA를 위한 전역 이미지 특징을 넘는 시각적 콘텐츠 모델링의 개선을 동기화한다.
질문에 집중하는 지역 이미지 영역에 초점을 맞춘 주의 메커니즘을 개발한다.
집중된 영역 특징과 전역 이미지 맥락 및 질문 표현을 융합한다.
대규모 VQA 벤치마크에서 베이스라인 및 이전 주의 모델 대비 성능 향상을 입증한다.

제안 방법

이미지에서 전역 및 영역 기반 CNN 특징을 추출한다.
질문과 관련된 후보 영역을 식별하기 위해 객체 검출기를 사용한다.
이미지 영역과 전체 이미지 맥락을 LSTM의 입력으로 표현하여 질문 단어의 순서에 따라 시각 정보를 인코딩한다.
질문을 LSTM으로 인코딩하여 질문 표현을 얻는다.
질문 단어 순서에 따라 영역 특징을 시퀀싱하고 이를 전역 특징과 결합하는 집중적 동적 주의 메커니즘을 적용한다.
질문 및 시각 표현을 tanh 및 ReLU 활성화를 통해 융합한 후, 원소별 곱셈 및 피드포워드 네트워크를 통해 1000개의 가장 일반적인 정답에 대해 SoftMax로 답을 예측한다.

실험 결과

연구 질문

RQ1질문 주도 집중이 객체 중심 이미지 영역에 대한 주의가 전역 또는 비집중 주의 방법에 비해 VQA 정확도를 향상시키는가?
RQ2지역화된 영역 특징과 글로벌 맥락을 모두 포함하는 것이 오픈 엔드 및 다지선다형 VQA 작업에 어떤 영향을 미치는가?
RQ3질문과 집중된 시각 특징의 LSTM 기반 융합이 VQA 벤치마크에서 최첨단 결과를 달성할 수 있는가?

주요 결과

FDA가 오픈 엔드 및 다지선다형 작업에 대해 VQA 데이터셋에서 최첨단 성능을 달성한다.
Open-ended test-dev: FDA 59.24 (All), 81.14 (Y/N), 45.77 (Other), 36.16 (Num); test-std: 59.54 (All).
Multiple-choice test-dev: FDA 64.01 (All), 81.50 (Y/N), 54.72 (Other), 39.00 (Num); test-std: 64.18 (All).
FDA가 오픈 엔드에서 SAN 기준선을 약 0.6% 포인트, 다지선다형 과제에서 약 1.1% 포인트 더 능가한다.
정성적 결과는 모델이 관련 영역에 집중할 때 색상, 수 세기 및 물체 식별 질문의 정확도가 향상됨을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.