[논문 리뷰] BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection
BLOCK은 블록-항 텐서 융합(BLOCK)을 도입하여 다중모달 상호작용을 구현하고, VQA 및 VRD에서 경쟁력 있는 최첨단 성능을 달성하는 한편 많은 경쟁자들보다 적은 파라미터를 사용합니다.
Multimodal representation learning is gaining more and more interest within the deep learning community. While bilinear models provide an interesting framework to find subtle combination of modalities, their number of parameters grows quadratically with the input dimensions, making their practical implementation within classical deep learning pipelines challenging. In this paper, we introduce BLOCK, a new multimodal fusion based on the block-superdiagonal tensor decomposition. It leverages the notion of block-term ranks, which generalizes both concepts of rank and mode ranks for tensors, already used for multimodal fusion. It allows to define new ways for optimizing the tradeoff between the expressiveness and complexity of the fusion model, and is able to represent very fine interactions between modalities while maintaining powerful mono-modal representations. We demonstrate the practical interest of our fusion model by using BLOCK for two challenging tasks: Visual Question Answering (VQA) and Visual Relationship Detection (VRD), where we design end-to-end learnable architectures for representing relevant interactions between modalities. Through extensive experiments, we show that BLOCK compares favorably with respect to state-of-the-art multimodal fusion models for both VQA and VRD tasks. Our code is available at https://github.com/Cadene/block.bootstrap.pytorch.
연구 동기 및 목표
- Motivate and address the parametric explosion in bilinear multimodal fusion for VQA and VRD.
- Propose BLOCK, a block-term tensor decomposition-based fusion to balance expressiveness and parameter efficiency.
- Demonstrate BLOCK’s efficacy on VQA 2.0, TDIUC, and VRD datasets.
- Provide extensive empirical comparisons against state-of-the-art fusion methods.
제안 방법
- Define a bilinear fusion model using a block-term decomposition of a third-order interaction tensor.
- Decompose the tensor into R blocks with factors A_r, B_r, C_r and a block-superdiagonal core D_r.
- Project inputs through mono-modal projections to get 1 in R-scale and 2 in M-scale, then fuse via blocks to produce y.
- Constrain the rank of the third-mode slices of each block to control complexity.
- Embed BLOCK into end-to-end architectures for VQA and VRD and optimize with standard stochastic methods.
- Compare BLOCK against CP, Tucker, MFB, MUTAN, MFH and others on standard benchmarks.
실험 결과
연구 질문
- RQ1Can BLOCK's block-term decomposition provide a better trade-off between expressiveness and parameter count compared to existing bilinear fusion approaches?
- RQ2How does BLOCK perform on VQA and VRD tasks relative to state-of-the-art fusion methods under various parameter regimes?
- RQ3What are the effects of the number of blocks R and block sizes on performance and model size?
- RQ4Does BLOCK maintain strong mono-modal representations while enabling rich cross-modal interactions?
주요 결과
- BLOCK achieves best results among several fusion schemes on VQA2 test-dev in the reported comparisons.
- BLOCK uses about 18M parameters for fusion, outperforming many higher-parameter methods in key metrics.
- On TDIUC, BLOCK outperforms prior methods with notable gains in bias-robust and harmonic metrics (A-NMPT, H-NMPT).
- On VRD, BLOCK outperforms previous methods across predicate, phrase, and relationship Recall@K without external data in many settings.
- BLOCK provides a favorable trade-off between modeling capacity and parameter count, often surpassing higher-parameter fusion models with significantly fewer parameters.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.