QUICK REVIEW

[论文解读] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Zhicheng Huang, Zhaoyang Zeng|arXiv (Cornell University)|Apr 2, 2020

Multimodal Machine Learning Applications参考文献 40被引用 286

一句话总结

Pixel-BERT 通过在端到端 Transformer 框架中将图像像素与文本对齐，学习通用的视觉-语言嵌入，在不使用区域特征的图像-文本对上进行预训练，在 VQA、NLVR2 和图像-文本检索任务中达到最先进的结果。

ABSTRACT

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks. Our Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks. It also relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic. To provide a better representation for down-stream tasks, we pre-train a universal end-to-end model with image and sentence pairs from Visual Genome dataset and MS-COCO dataset. We propose to use a random pixel sampling mechanism to enhance the robustness of visual representation and to apply the Masked Language Model and Image-Text Matching as pre-training tasks. Extensive experiments on downstream tasks with our pre-trained model show that our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR). Particularly, we boost the performance of a single model in VQA task by 2.17 points compared with SOTA under fair comparison.

研究动机与目标

动机：直接在像素级对齐视觉和语言语义，而不是通过区域特征。
提出一个端到端的 Pixel-BERT 模型，将 CNN 视觉编码器与多模态 Transformer 相结合。
在大型图像-文本数据集上进行预训练，使用 MLM 和 ITM，并结合像素采样机制以提升鲁棒性。
相比以往基于区域的方法，展示在 VQA、NLVR2 以及图像-文本检索任务上的性能提升。

提出的方法

使用全卷积 CNN 主干将图像像素编码为视觉嵌入。
使用类似 BERT 的词级嵌入以及位置/语义编码对语言进行嵌入。
在一个 Transformer 中结合视觉和语言嵌入，以学习跨模态交互。
通过对文本在视觉输入条件下进行掩码语言建模（MLM）以及图像-文本匹配（ITM）进行预训练，以学习对齐。
在预训练阶段引入随机像素采样机制，以提高鲁棒性并减少过拟合。
通过将 [CLS] 标记输入到特定任务的分类器中，对下游任务进行微调。

实验结果

研究问题

RQ1将像素级视觉表示与文本共同学习，是否能提升超越区域特征的跨模态理解？
RQ2在像素级输入上进行 MLM 和 ITM 的预训练任务，是否能带来更好的视觉-语言对齐和下游任务性能？
RQ3与基于区域的方法相比，像素级跨模态注意力对 VQA、NLVR2 和图像-文本检索有何影响？

主要发现

模型	test-dev	test-std
Pixel-BERT (r50)	71.35	71.42
Pixel-BERT (x152)	74.45	74.55

Pixel-BERT 采用 ResNeXt-152 主干，在 VQA test-std 上达到 74.55，超越了若干先前方法。
Pixel-BERT (x152) 也在 test-dev 上达到 74.45，并在公平比较下超越 VQA 的现有最先进水平。
在 NLVR 2 中，Pixel-BERT 在 test-P 上达到 77.2，在 dev 上达到 76.5，优于若干对比基线。
在图像-文本检索上，Pixel-BERT 相较于 Unicoder-VL 和 UNITER 显著提升，在 MS-COCO 与 Flickr30K 数据集的召回指标上有改进。
消融研究表明 MLM 和 ITM 能显著提升下游任务，像素随机采样带来额外增益，尤其在检索任务中。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。