QUICK REVIEW

[논문 리뷰] Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Akira Fukui, Dong Huk Park|arXiv (Cornell University)|2016. 06. 06.

Multimodal Machine Learning Applications참고 문헌 53인용 수 394

한 줄 요약

다중모달 콤팩트 바이너 풀링(MCB)을 도입해 시각적 특징과 텍스트 특징을 효율적으로 융합하여 VQA 및 시각적 바인딩을 개선하고, VQA 데이터셋에서 최첨단 성능 및 바인딩 정확도를 향상시킵니다.

ABSTRACT

Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual vectors. As the outer product is typically infeasible due to its high dimensionality, we instead propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and expressively combine multimodal features. We extensively evaluate MCB on the visual question answering and grounding tasks. We consistently show the benefit of MCB over ablations without MCB. For visual question answering, we present an architecture which uses MCB twice, once for predicting attention over spatial features and again to combine the attended representation with the question representation. This model outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge.

연구 동기 및 목표

단순 연결(concatenation)이나 요소별 연산보다 표현력이 높은 다중모달 융합을 모티브로 삼습니다.
이미지와 텍스트 간의 외곽곱(outer-product) 상호작용을 효율적으로 근사하기 위해 MCB를 제안합니다.
MCB를 VQA(주의(attention) 포함)와 시각 바인딩에 적용하고 다수의 데이터세트에서 평가합니다.
MCB 기반 모델이 강력한 baselines 및 ablation보다 개선됨을 입증합니다.

제안 방법

Count Sketch 프로젝션과 FFT 기반 컨볼루션을 사용하여 외곽곱 상호작용을 근사하는 Multimodal Compact Bilinear pooling(MCB)을 정의합니다.
MCB를 적용하여 이미지 피처(CNN)와 질문 임베딩(LSTM)을 16k 차원의 공동 표현으로 융합합니다.
각 격자 위치에서 언어-시각 쌍에 MCB를 적용하고 주의 맵(attention map)을 예측하여 공간 피처에 대한 소프트 어텐션을 통합합니다.
여러 시선(다중 글림) 및 다중 선택 설정에서 정답 인코딩을 위한 추가 MCB 인 어텐션 브랜치를 확장합니다.
시각 바인딩에서 GroundeR의 연결(concatenation)을 MCB로 대체하여 구문과 시각 제안들을 결합하고 L2 정규화 임베딩을 사용합니다.

실험 결과

연구 질문

RQ1다중모달 콤팩트 바이너 풀링이 VQA 및 바인딩에서 연결 또는 요소별 풀링보다 표현력이 풍부한 융합을 제공합니까?
RQ2주의 메커니즘 및 다중 질문-답 설정과 함께 통합될 때 MCB의 성능에 어떤 영향이 있습니까?
RQ3MCB 특성의 차원 d가 VQA 및 바인딩에 미치는 영향은 무엇입니까?
RQ4MCB가 여러 데이터세트에서 VQA 데이터셋 및 바인딩 벤치마크의 최첨단 결과를 개선할 수 있습니까?

주요 결과

방법	정확도
원소별 합	56.50
연결	57.49
연결 + FC	58.40
연결 + FC + FC	57.10
원소별 곱	58.57
원소별 곱 + FC	56.44
원소별 곱 + FC + FC	57.88
MCB (2048x2048 -> 16K)	59.83
풀 바일리니어(Bilinear) (128x128 -> 16K)	58.46
MCB (128x128 -> 4K)	58.69
원소별 곱 with VGG-19	55.97
MCB (d=16K) with VGG-19	57.05
연결 + FC with Attention	58.36
MCB (d=16K) with Attention	62.50

MCB는 VQA 및 바인딩 작업에서 비바이너 풀링 기반 baselines를 능가합니다.
MCB를 사용한 소프트 어텐션이 최상의 결과를 낳으며, MCB 피처에 대한 어텐션은 연결 계층의 어텐션보다 우수합니다.
16k 차원 MCB 피처를 사용하는 것이 오픈 엔디드 VQA 설정에서 가장 높은 정확도를 제공합니다.
최고의 단일 모델(MCB + 두 개의 어텐션, Visual Genome 데이터 및 GloVe 활용)은 VQA 오픈 엔디드 및 다중 선택 벤치마크에서 경쟁 방법을 능가합니다.
MCB 기반 바인딩은 Flickr30k Entities 및 ReferItGame 데이터셋에서 최첨단 정확도를 달성합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.