QUICK REVIEW

[论文解读] Supervised Multimodal Bitransformers for Classifying Images and Text

Douwe Kiela, Suvrat Bhooshan|arXiv (Cornell University)|Sep 6, 2019

Multimodal Machine Learning Applications参考文献 50被引用 163

一句话总结

Introduce a supervised multimodal bitransformer (MMBT) that maps image embeddings into BERT's token space to fuse text and image modalities, achieving competitive results with ViLBERT on text-heavy multimodal classification tasks without multimodal pretraining.

ABSTRACT

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.

研究动机与目标

Motivate the need for effective multimodal fusion where text is the dominant modality.
Propose a simple baseline that finetunes unimodally pretrained text and image encoders for multimodal tasks.
Show that self-attention over both modalities yields strong performance on text-heavy multimodal classification tasks.
Demonstrate that the approach is competitive with multimodally pretrained models like ViLBERT while being simpler and extensible.

提出的方法

Use a ResNet-152 image encoder to produce N image embeddings from KM grid cells.
Project each image embedding into D-dimensional BERT input space via learned matrices W_n.
Combine text and image embeddings as contextual embeddings fed into a BERT-like bidirectional transformer initialized from pretrained BERT weights.
Fine-tune the architecture end-to-end with task-appropriate loss (multiclass cross-entropy or binary cross-entropy for multilabel).
Handle variable modality presence by segment embeddings and a flexible input layer compatible with multiple modalities.

实验结果

研究问题

RQ1Can unimodally pretrained text and image encoders fused through self-attention outperform traditional multimodal fusion baselines on text-heavy multimodal tasks?
RQ2How close can a supervised multimodal fusion model approach the performance of self-supervised multimodal pretraining schemes like ViLBERT?
RQ3Does freezing/unfreezing of components during fine-tuning affect multimodal fusion quality?
RQ4Are the proposed image-to-BERT space mappings robust to missing modalities during inference?
RQ5What are the comparative benefits of MMBT versus concatenation-based or gate-based fusion methods on hard multimodal cases?

主要发现

MMBT outperforms several strong fusion baselines on text-heavy multimodal tasks (MM-IMDB, FOOD101, V-SNLI).
MMBT is competitive with ViLBERT, sometimes matching or exceeding its performance without multimodal pretraining.
Hard-subset evaluations show MMBT maintaining strong multimodal performance when unimodal signals conflict.
Freezing/unfreezing experiments indicate a strategy where the image encoder is unfrozen earlier yields better multimodal integration.
Constrained-parameter comparisons suggest MMBT can outperform deeper ConcatBert configurations, indicating effective fusion beyond parameter count alone.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。