Skip to main content
QUICK REVIEW

[Paper Review] Multi-modality Latent Interaction Network for Visual Question Answering

Peng Gao, Haoxuan You|arXiv (Cornell University)|Aug 10, 2019
Multimodal Machine Learning Applications57 references34 citations
TL;DR

MLIN introduces Multi-modality Latent Interaction modules that summarize visual and language information into a small set of latent vectors, model cross-modal relations among these summaries, and update features through attention-based aggregation to improve VQA performance.

ABSTRACT

Exploiting relationships between visual regions and question words have achieved great success in learning multi-modality features for Visual Question Answering (VQA). However, we argue that existing methods mostly model relations between individual visual regions and words, which are not enough to correctly answer the question. From humans' perspective, answering a visual question requires understanding the summarizations of visual and language information. In this paper, we proposed the Multi-modality Latent Interaction module (MLI) to tackle this problem. The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations. The cross-modality information between the latent summarizations are propagated to fuse valuable information from both modalities and are used to update the visual and word features. Such MLI modules can be stacked for several stages to model complex and latent relations between the two modalities and achieves highly competitive performance on public VQA benchmarks, VQA v2.0 and TDIUC . In addition, we show that the performance of our methods could be significantly improved by combining with pre-trained language model BERT.

Motivation & Objective

  • Motivate the need to move beyond region-word relations by learning high-level latent summaries of each modality.
  • Propose the MLIN framework that summarizes visual and language information into a small number of latent vectors.
  • Model cross-modal relations between latent visual-language summaries and propagate information between them.
  • Update original visual and word features via attention mechanisms to predict answers.
  • Show that integrating with a pre-trained language model (BERT) improves VQA performance.

Proposed method

  • Encode visual regions with Faster R-CNN and questions with a bidirectional Transformer to obtain R in R^{M x 512} and E in R^{N x 512}.
  • Generate k latent summarization vectors for each modality via learned linear mappings, turning R and E into lat representations ormer each modality.
  • Construct a k x k cross-modal relation tensor A(i,j,:) = W_A [ overline{R}(i,:) Doverline{E}(j,:) ] + b_A to capture pairwise latent interactions.
  • Propagate information across paired latent features via two operations: (i) a cross-modal transformation on A to produce lat_A_c, (ii) a second propagation to exchange higher-order information across all pairs to produce lat_A_p; sum them to obtain lat_A.
  • Aggregate updated latent representations back to the original modalities using key-query attention to produce R_U and E_U.
  • Stack multiple MLI modules to progressively refine features, then pool and fuse via elementwise multiplication for final answer prediction with a linear classifier.

Experimental results

Research questions

  • RQ1Can learning a small set of latent cross-modal summaries improve VQA by focusing on high-level interactions rather than all region-word pairs?
  • RQ2How does propagating information among latent summaries affect cross-modal reasoning and final VQA accuracy?
  • RQ3What is the impact of integrating a pre-trained language model (BERT) into the MLIN framework on VQA performance?

Key findings

  • MLIN achieves competitive performance on VQA v2.0 and TDIUC benchmarks.
  • Using 6 visual and 6 question latent summaries with 3x3 attention heads yields strong results in ablations.
  • Relational reasoning via latent summaries reduces required message passing while maintaining competitive accuracy compared to prior methods like DFAF.
  • Incorporating BERT finetuning (with careful learning-rate scheduling) further improves accuracy over the MLIN baseline.
  • Deeper stacking (MLIN-8) generally improves performance over shallower configurations in ablations.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.