[论文解读] Supervised Multimodal Bitransformers for Classifying Images and Text
Introduce a supervised multimodal bitransformer (MMBT) that maps image embeddings into BERT's token space to fuse text and image modalities, achieving competitive results with ViLBERT on text-heavy multimodal classification tasks without multimodal pretraining.
Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.
研究动机与目标
- Motivate the need for effective multimodal fusion where text is the dominant modality.
- Propose a simple baseline that finetunes unimodally pretrained text and image encoders for multimodal tasks.
- Show that self-attention over both modalities yields strong performance on text-heavy multimodal classification tasks.
- Demonstrate that the approach is competitive with multimodally pretrained models like ViLBERT while being simpler and extensible.
提出的方法
- Use a ResNet-152 image encoder to produce N image embeddings from KM grid cells.
- Project each image embedding into D-dimensional BERT input space via learned matrices W_n.
- Combine text and image embeddings as contextual embeddings fed into a BERT-like bidirectional transformer initialized from pretrained BERT weights.
- Fine-tune the architecture end-to-end with task-appropriate loss (multiclass cross-entropy or binary cross-entropy for multilabel).
- Handle variable modality presence by segment embeddings and a flexible input layer compatible with multiple modalities.
实验结果
研究问题
- RQ1Can unimodally pretrained text and image encoders fused through self-attention outperform traditional multimodal fusion baselines on text-heavy multimodal tasks?
- RQ2How close can a supervised multimodal fusion model approach the performance of self-supervised multimodal pretraining schemes like ViLBERT?
- RQ3Does freezing/unfreezing of components during fine-tuning affect multimodal fusion quality?
- RQ4Are the proposed image-to-BERT space mappings robust to missing modalities during inference?
- RQ5What are the comparative benefits of MMBT versus concatenation-based or gate-based fusion methods on hard multimodal cases?
主要发现
- MMBT outperforms several strong fusion baselines on text-heavy multimodal tasks (MM-IMDB, FOOD101, V-SNLI).
- MMBT is competitive with ViLBERT, sometimes matching or exceeding its performance without multimodal pretraining.
- Hard-subset evaluations show MMBT maintaining strong multimodal performance when unimodal signals conflict.
- Freezing/unfreezing experiments indicate a strategy where the image encoder is unfrozen earlier yields better multimodal integration.
- Constrained-parameter comparisons suggest MMBT can outperform deeper ConcatBert configurations, indicating effective fusion beyond parameter count alone.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。