QUICK REVIEW

[論文レビュー] BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Hedi Ben-younes, Rémi Cadène|arXiv (Cornell University)|Jan 31, 2019

Multimodal Machine Learning Applications参考文献 44被引用数 213

ひとこと要約

BLOCK introduces a block-term tensor fusion (BLOCK) for multimodal interactions, achieving competitive state-of-the-art results on VQA and VRD while using fewer parameters than many competitors.

ABSTRACT

Multimodal representation learning is gaining more and more interest within the deep learning community. While bilinear models provide an interesting framework to find subtle combination of modalities, their number of parameters grows quadratically with the input dimensions, making their practical implementation within classical deep learning pipelines challenging. In this paper, we introduce BLOCK, a new multimodal fusion based on the block-superdiagonal tensor decomposition. It leverages the notion of block-term ranks, which generalizes both concepts of rank and mode ranks for tensors, already used for multimodal fusion. It allows to define new ways for optimizing the tradeoff between the expressiveness and complexity of the fusion model, and is able to represent very fine interactions between modalities while maintaining powerful mono-modal representations. We demonstrate the practical interest of our fusion model by using BLOCK for two challenging tasks: Visual Question Answering (VQA) and Visual Relationship Detection (VRD), where we design end-to-end learnable architectures for representing relevant interactions between modalities. Through extensive experiments, we show that BLOCK compares favorably with respect to state-of-the-art multimodal fusion models for both VQA and VRD tasks. Our code is available at https://github.com/Cadene/block.bootstrap.pytorch.

研究の動機と目的

Motivate and address the parametric explosion in bilinear multimodal fusion for VQA and VRD.
Propose BLOCK, a block-term tensor decomposition-based fusion to balance expressiveness and parameter efficiency.
Demonstrate BLOCK’s efficacy on VQA 2.0, TDIUC, and VRD datasets.
Provide extensive empirical comparisons against state-of-the-art fusion methods.

提案手法

Define a bilinear fusion model using a block-term decomposition of a third-order interaction tensor.
Decompose the tensor into R blocks with factors A_r, B_r, C_r and a block-superdiagonal core D_r.
Project inputs through mono-modal projections to get 1 in R-scale and 2 in M-scale, then fuse via blocks to produce y.
Constrain the rank of the third-mode slices of each block to control complexity.
Embed BLOCK into end-to-end architectures for VQA and VRD and optimize with standard stochastic methods.
Compare BLOCK against CP, Tucker, MFB, MUTAN, MFH and others on standard benchmarks.

実験結果

リサーチクエスチョン

RQ1Can BLOCK's block-term decomposition provide a better trade-off between expressiveness and parameter count compared to existing bilinear fusion approaches?
RQ2How does BLOCK perform on VQA and VRD tasks relative to state-of-the-art fusion methods under various parameter regimes?
RQ3What are the effects of the number of blocks R and block sizes on performance and model size?
RQ4Does BLOCK maintain strong mono-modal representations while enabling rich cross-modal interactions?

主な発見

BLOCK achieves best results among several fusion schemes on VQA2 test-dev in the reported comparisons.
BLOCK uses about 18M parameters for fusion, outperforming many higher-parameter methods in key metrics.
On TDIUC, BLOCK outperforms prior methods with notable gains in bias-robust and harmonic metrics (A-NMPT, H-NMPT).
On VRD, BLOCK outperforms previous methods across predicate, phrase, and relationship Recall@K without external data in many settings.
BLOCK provides a favorable trade-off between modeling capacity and parameter count, often surpassing higher-parameter fusion models with significantly fewer parameters.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。