QUICK REVIEW

[论文解读] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

Wanrong Zhu, Jack Hessel|arXiv (Cornell University)|Apr 14, 2023

Multimodal Machine Learning Applications被引用 17

一句话总结

MMC4 是一个公开的十亿规模图文语料库，通过在文本-only c4 数据集的句子中交错嵌入图像来实现多模态上下文学习；它实现了强的文档内对齐，并提供用于隐私与效率的子集，用 OpenFlamingo 实验验证其效用。

ABSTRACT

In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. Multimodal C4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (88%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (80%). After filtering NSFW images, ads, etc., the resulting corpus consists of 101.2M documents with 571M images interleaved in 43B English tokens.

研究动机与目标

推动创建一个大型、公开可用的交错图文语料库，以支持多模态在上下文中的学习。
描述使用基于 CLIP 的线性分配将图像与文档内的句子对齐的构建管线。
评估图像与文本在广泛主题和文档来源上的质量、相关性与对齐。
提供过滤后的子集（mmc4-ff 和 mmc4-core）以满足隐私和开发需求。
展示在 mmc4-core 语料上使用 OpenFlamingo 进行早期多模态模型训练的收益。

提出的方法

通过在双分配设置中从网页检索并交错图像来扩展文本型 c4 语料库。
使用 CLIP ViT-L/14 计算每个文档内的图像-句子成对相似度。
应用线性分配算法，在每个句子仅有一张图像的约束下将图像与句子匹配。
使用多阶段管道对图像的大小、纵横比、重复和不适宜内容进行过滤。
创建子集 mmc4-ff（更少人脸）和 mmc4-core（更严格的过滤与尺寸缩减）。
提供文档内相似性矩阵和对齐，便于采用替代的分配方法。

实验结果

研究问题

RQ1大规模交错的图像+文本数据是否能相较于非交错的图像-字幕数据提升多模态在上下文中的学习？
RQ2交错在文本中的图像与文档内句子的对齐程度如何，这种对齐在不同主题上的质量如何？
RQ3对下游模型训练，过滤（隐私、NSFW、面部）以及文档/图像统计数据的影响与权衡是什么？
RQ4类似 mmc4-ff 和 mmc4-core 的子集是否为开发者提供可用的、注重隐私的替代方案？

主要发现

MMC4 包含 101.2M 个文档，交错了 571M 张图像，总计 43B 个标记。
发布了两种主要子集：mmc4-ff（较少人脸）和 mmc4-core（更严格过滤）。
人工抽样显示 88% 的图像与其文档在主题上相关，80% 的图像与所分配的句子对齐良好。
使用零-shot CLIP ViT-L/14 在文档内图像-文本对齐上优于某些微调基线。
线性分配使图像在句子间分布更均匀，将具有图像的句子平均比例从 22%（最大分配）提高到 34%（线性分配）。
对 200 个 mmc4 文档的随机抽样（836 张图像）显示 87.7% 的图像主题相关，80.4% 的图像与句子对齐；28.3% 含有人脸，1.6% 水印，3.9% 标志，3.2% 广告，0.7% 重复。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。