QUICK REVIEW

[论文解读] VD-BERT: A Unified Vision and Dialog Transformer with BERT

Yue Wang, Shafiq Joty|arXiv (Cornell University)|Apr 28, 2020

Multimodal Machine Learning Applications参考文献 60被引用 30

一句话总结

VD-BERT 引入一个基于 BERT 的单流视觉-对话 Transformer，能够联合建模图像内容与多轮对话，在 VisDial 上实现最先进的 NDCG，同时无需外部视觉-语言预训练。

ABSTRACT

Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at https://github.com/salesforce/VD-BERT.

研究动机与目标

将视觉对话视为需要整合图像内容与对话历史的多轮推理任务。
提出一个统一的 Transformer 模型，处理 Visual Dialog 中的判别（排序）和生成（生成）任务。
证明通过 BERT 进行的可视化对齐训练可以在没有大量外部视觉-语言预训练的情况下取得最先进的结果。

提出的方法

将图像编码为对象级特征，并在初始化为 BERT 的单一 Transformer 编码器中与标题（caption）和多轮对话融合。
使用可视化对齐的训练目标（Masked Language Modeling 和 Next Sentence Prediction），并采用两个自注意力掩码（双向与 seq2seq），以在判别和生成设置之间实现兼容。
将每个答案候选项附加到输入中，从而在序列中与其他实体进行早期融合。
对于判别训练，依据 NSP 得分对候选项进行排序；对于生成训练，使用相同编码器在适当的掩码下自回归地生成答案。
在密集相关注释上使用排序损失（ListNet）进行微调，以提高排序质量。

实验结果

研究问题

RQ1单一的统一 Transformer 编码器是否能够有效建模视觉对话中图像对象、对话历史和候选答案之间的双向交互？
RQ2是否有可能在不使用分离解码器或外部视觉-语言预训练的情况下，训练一个基于 BERT 的模型来完成 VisDial 的判别（排序）和生成（生成）任务？
RQ3可视化对齐的 MLM 与 NSP 目标如何影响视觉与对话模态的融合？

主要发现

VD-BERT 在 VisDial v1.0 测试集的单模型设置中（NDCG 74.54）达到新的最先进水平，并在集成模型中达到（NDCG 75.35）。
VD-BERT 在判别任务上超越此前的单模型基线，且在没有外部视觉-语言预训练的情况下提供有竞争力的生成结果。
密集注释微调显著提升 NDCG（例如从 59.96 提升到 74.54），但可能降低如 MRR 与 R@k 等其他指标，表明指标不一致。
从 BERT 初始化相比从头开始训练能够带来巨大提升；通过 MLM 的可视化对齐对多模态迁移至关重要。
一个具有两种自注意力掩码的统一 Transformer 可以在没有显式解码器的情况下支持判别与生成的 VisDial 设置。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。