Skip to main content
QUICK REVIEW

[论文解读] Hadamard Product for Low-rank Bilinear Pooling

Jin-Hwa Kim, Kyoung Woon On|arXiv (Cornell University)|Oct 14, 2016
Multimodal Machine Learning Applications被引用 179
一句话总结

该论文介绍使用 Hadamard 乘积的低秩双线性池化(MLB),作为紧凑双线性池化在视觉问答中的高效替代,达到VQA的state-of-the-art结果,并具有更好的参数效率。

ABSTRACT

Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmentation, and visual question-answering, to get state-of-the-art performances taking advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally complex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning. We show that our model outperforms compact bilinear pooling in visual question-answering tasks with the state-of-the-art results on the VQA dataset, having a better parsimonious property.

研究动机与目标

  • Motivate and address high dimensionality of full bilinear pooling for multimodal learning.
  • Propose a low-rank bilinear pooling mechanism using Hadamard product to reduce parameters while preserving expressiveness.
  • Apply the method to an attention-based multimodal network for VQA and analyze architecture choices.
  • Demonstrate state-of-the-art performance on VQA with a parsimonious model and provide ablations.

提出的方法

  • Factor a bilinear weight tensor W as W = UV^T to enforce low rank and compute f = P^T (U^T x ∘ V^T y) where ∘ is Hadamard product.
  • Optionally extend with biases (full model) and non-linear activations after the inputs or after the Hadamard product, plus residual shortcut connections as in residual networks.
  • Extend low-rank pooling to Multimodal Low-rank Bilinear Attention Networks (MLB) for VQA by using low-rank pooling in attention over image features and in computing the answer distribution.
  • Define attention α with a low-rank bilinear form over question q and visual features F, optionally across multiple glimpses G, and predict answers via another low-rank bilinear interaction.
  • Explore design choices: number of learning blocks, number of glimpses, non-linearity placement, answer sampling, shortcut connections, and data augmentation.

实验结果

研究问题

  • RQ1Can low-rank bilinear pooling via Hadamard product approximate full bilinear pooling effectively for multimodal tasks?
  • RQ2Does MLB provide competitive or superior performance to compact bilinear pooling on visual question answering?
  • RQ3What architectural choices (depth, glimpses, non-linearity placement, residual connections) optimize performance and parameter efficiency in MLB-based models?
  • RQ4What is the impact of data augmentation (e.g., Visual Genome) on VQA performance with MLB?
  • RQ5How does MLB compare to state-of-the-art single-model and ensemble models on VQA benchmarks?

主要发现

  • MLB achieves state-of-the-art results on VQA, outperforming compact bilinear pooling baselines while offering better parameter parsimony.
  • Two-block models with one or two glimpses provide strong performance; increasing depth beyond two blocks shows diminishing returns in this setup.
  • Non-linear activations improve performance; placement of the activation (before vs after Hadamard product) shows similar benefits in experiments.
  • Data augmentation with Visual Genome significantly improves accuracy, especially on ETC-type answers.
  • Compared to contemporary methods, MLB achieves higher Open-Ended accuracy (≈65%+) and competitive MC accuracy, with MLB outperforming several single-model baselines and approaching ensemble performance.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。