QUICK REVIEW

[论文解读] Multimodal Masked Autoencoders Learn Transferable Representations

Xinyang Geng, Hao Líu|arXiv (Cornell University)|May 27, 2022

Multimodal Machine Learning Applications被引用 29

一句话总结

M3AE 通过掩蔽标记重构学习统一的视觉-语言表示，不需要模态特定编码器或对比学习，能够为下游任务（如 ImageNet 线性分类与 OOD 检测）获得可迁移的表示。

ABSTRACT

Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. Surprisingly, we find that M3AE benefits from a higher text mask ratio (50-90%), in contrast to BERT whose standard masking ratio is 15%, due to the joint training of two data modalities. We also provide qualitative analysis showing that the learned representation incorporates meaningful information from both image and language. Lastly, we demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data.

研究动机与目标

研究一个仅通过掩蔽标记预测训练的大型多模态模型是否能学习到在视觉和语言之间具有可迁移性的表示。
开发一个简单、可扩展的架构，在两种模态间使用一个统一的编码器，而非模态特定的编码器。
评估在大规模图像-文本数据上的多模态预训练如何影响下游任务的性能，如图像分类和 OOD 检测。
评估模型在单一训练框架中同时利用配对数据和非配对数据的能力。

提出的方法

将图像-文本对视为一个长序列（图像补丁 + 文本标记）。
对图像补丁和文本标记进行高比例掩蔽，并通过统一的变换器编码器-解码器重建缺失部分。
使用模态特定的嵌入与一个共享的 CLS token，将两种模态映射到一个共同表示空间。
以重建目标进行训练：对被掩蔽的图像补丁使用均方误差（MSE），对被掩蔽的文本标记使用交叉熵，仅对被掩蔽的元素进行训练。
允许在配对数据和非配对数据的混合数据上进行训练，在没有对比损失的情况下实现灵活的数据利用。

实验结果

研究问题

RQ1M3AE 是否能学习出可泛化的表示，并迁移到下游任务，如 ImageNet 分类和 OOD 检测？
RQ2学习到的表示是否包含来自图像和语言模态的有意义信息？
RQ3模型规模、训练时间和掩蔽策略如何影响性能和迁移能力？
RQ4M3AE 能否在单一训练目标中有效利用配对图像文本数据与非配对数据？

主要发现

模型	MAE	M3AE	CLIP	有监督
Accuracy	44.6	61.3	69.0	81.8
M3AE text ratio	10%	20%	30%	100%
Accuracy	53.3	54.0	54.5	58.8

M3AE 在 ImageNet 线性分类上的表现显著超过 MAE（例如在某些设置中为 61.3 对 44.6）。
M3AE 可以利用配对数据与非配对数据的混合，即使部分数据未配对也能实现强有力的迁移。
较高的文本掩蔽比率（约 50-75% 或更多）能提升 M3AE 的性能，与传统的 BERT 风格设置不同。
M3AE 能随着模型规模增大和训练时间延长而良好扩展，在 ViT-S/16、ViT-B/16、ViT-L/16 等变体上持续优于 MAE。
定性分析显示注意力与相关图像区域及相应文本标记对齐，表明实现了联合的视觉-语言理解。
M3AE 在 CC12M 和 ImageNet 的异常分布检测与重建质量方面表现出鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。