Skip to main content
QUICK REVIEW

[论文解读] Supervised Multimodal Bitransformers for Classifying Images and Text

Douwe Kiela, Suvrat Bhooshan|arXiv (Cornell University)|Sep 6, 2019
Multimodal Machine Learning Applications参考文献 50被引用 163
一句话总结

Introduce a supervised multimodal bitransformer (MMBT) that maps image embeddings into BERT's token space to fuse text and image modalities, achieving competitive results with ViLBERT on text-heavy multimodal classification tasks without multimodal pretraining.

ABSTRACT

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.

研究动机与目标

  • Motivate the need for effective multimodal fusion where text is the dominant modality.
  • Propose a simple baseline that finetunes unimodally pretrained text and image encoders for multimodal tasks.
  • Show that self-attention over both modalities yields strong performance on text-heavy multimodal classification tasks.
  • Demonstrate that the approach is competitive with multimodally pretrained models like ViLBERT while being simpler and extensible.

提出的方法

  • Use a ResNet-152 image encoder to produce N image embeddings from KM grid cells.
  • Project each image embedding into D-dimensional BERT input space via learned matrices W_n.
  • Combine text and image embeddings as contextual embeddings fed into a BERT-like bidirectional transformer initialized from pretrained BERT weights.
  • Fine-tune the architecture end-to-end with task-appropriate loss (multiclass cross-entropy or binary cross-entropy for multilabel).
  • Handle variable modality presence by segment embeddings and a flexible input layer compatible with multiple modalities.

实验结果

研究问题

  • RQ1Can unimodally pretrained text and image encoders fused through self-attention outperform traditional multimodal fusion baselines on text-heavy multimodal tasks?
  • RQ2How close can a supervised multimodal fusion model approach the performance of self-supervised multimodal pretraining schemes like ViLBERT?
  • RQ3Does freezing/unfreezing of components during fine-tuning affect multimodal fusion quality?
  • RQ4Are the proposed image-to-BERT space mappings robust to missing modalities during inference?
  • RQ5What are the comparative benefits of MMBT versus concatenation-based or gate-based fusion methods on hard multimodal cases?

主要发现

  • MMBT outperforms several strong fusion baselines on text-heavy multimodal tasks (MM-IMDB, FOOD101, V-SNLI).
  • MMBT is competitive with ViLBERT, sometimes matching or exceeding its performance without multimodal pretraining.
  • Hard-subset evaluations show MMBT maintaining strong multimodal performance when unimodal signals conflict.
  • Freezing/unfreezing experiments indicate a strategy where the image encoder is unfrozen earlier yields better multimodal integration.
  • Constrained-parameter comparisons suggest MMBT can outperform deeper ConcatBert configurations, indicating effective fusion beyond parameter count alone.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。