QUICK REVIEW

[论文解读] Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Zhe Gan, Yen-Chun Chen|arXiv (Cornell University)|Jun 11, 2020

Multimodal Machine Learning Applications参考文献 89被引用 287

一句话总结

Villa 通过在图像和文本模态的嵌入中进行对抗性扰动，在一个两阶段框架（对抗性预训练和对抗性微调）下，引入大规模的视觉-语言模型对抗训练，在多项 V+L 任务中达到最新状态的性能。

ABSTRACT

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

研究动机与目标

Motivate robust multimodal pre-training to improve generalization on downstream V+L tasks.
Propose adversarial training in embedding space rather than pixel/token space for scalability and effectiveness.
Demonstrate two-stage adversarial training (pre-training and fine-tuning) across multiple V+L architectures.
Show empirical gains on a wide suite of vision-and-language benchmarks.

提出的方法

Perform adversarial perturbations in embedding space for both image and text modalities, adding perturbations to image region features and word embeddings.
Use a two-stage training regime: task-agnostic adversarial pre-training followed by task-specific adversarial fine-tuning.
Adopt a “free” adversarial training strategy to enable large-scale training by accumulating gradients over multiple PGD steps.
Regularize with KL-divergence based terms to promote confidence smoothness and defense against perturbations.
Optimize a composite objective combining standard loss, adversarial training loss, and KL-based regularization as in equation formulations.
Apply adversarial training in both MLM and ITM pre-training tasks and in downstream fine-tuning (e.g., VQA, VCR).

实验结果

研究问题

RQ1Can embedding-space adversarial perturbations improve generalization of vision-and-language models across diverse tasks?
RQ2Does adversarial pre-training plus adversarial fine-tuning yield additive gains over standard training for V+L models?
RQ3Is perturbing image features or text embeddings (or both) more beneficial in large-scale V+L training?
RQ4How does a “free” adversarial training regime impact training efficiency and performance at scale?

主要发现

Villa consistently improves state-of-the-art across six V+L tasks when applied to UNITER (base and large) and improves LXMERT when applied to its finetuning stage.
Villa-base improves VQA by +0.76 and VCR by +2.4 on Q→AR over UNITER-base; Villa-large improves VQA and VCR with larger gains (e.g., VCR Q→AR +2.9).
Adversarial pre-training and adversarial fine-tuning individually boost performance; combining both yields the largest gains.
Perturbing only image features or only text embeddings both provide substantial improvements, with image perturbations offering notable gains in several tasks.
Villa outperforms FreeLB in ablations, and attains better probing signals for multimodal alignment (e.g., higher attention to visual coreference and relations).
Applying Villa to LXMERT (finetuning only) yields average gains of about +0.88 across VQA, GQA, and NLVR2.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。