[论文解读] Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Villa 通过在图像和文本模态的嵌入中进行对抗性扰动,在一个两阶段框架(对抗性预训练和对抗性微调)下,引入大规模的视觉-语言模型对抗训练,在多项 V+L 任务中达到最新状态的性能。
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.
研究动机与目标
- Motivate robust multimodal pre-training to improve generalization on downstream V+L tasks.
- Propose adversarial training in embedding space rather than pixel/token space for scalability and effectiveness.
- Demonstrate two-stage adversarial training (pre-training and fine-tuning) across multiple V+L architectures.
- Show empirical gains on a wide suite of vision-and-language benchmarks.
提出的方法
- Perform adversarial perturbations in embedding space for both image and text modalities, adding perturbations to image region features and word embeddings.
- Use a two-stage training regime: task-agnostic adversarial pre-training followed by task-specific adversarial fine-tuning.
- Adopt a “free” adversarial training strategy to enable large-scale training by accumulating gradients over multiple PGD steps.
- Regularize with KL-divergence based terms to promote confidence smoothness and defense against perturbations.
- Optimize a composite objective combining standard loss, adversarial training loss, and KL-based regularization as in equation formulations.
- Apply adversarial training in both MLM and ITM pre-training tasks and in downstream fine-tuning (e.g., VQA, VCR).
实验结果
研究问题
- RQ1Can embedding-space adversarial perturbations improve generalization of vision-and-language models across diverse tasks?
- RQ2Does adversarial pre-training plus adversarial fine-tuning yield additive gains over standard training for V+L models?
- RQ3Is perturbing image features or text embeddings (or both) more beneficial in large-scale V+L training?
- RQ4How does a “free” adversarial training regime impact training efficiency and performance at scale?
主要发现
- Villa consistently improves state-of-the-art across six V+L tasks when applied to UNITER (base and large) and improves LXMERT when applied to its finetuning stage.
- Villa-base improves VQA by +0.76 and VCR by +2.4 on Q→AR over UNITER-base; Villa-large improves VQA and VCR with larger gains (e.g., VCR Q→AR +2.9).
- Adversarial pre-training and adversarial fine-tuning individually boost performance; combining both yields the largest gains.
- Perturbing only image features or only text embeddings both provide substantial improvements, with image perturbations offering notable gains in several tasks.
- Villa outperforms FreeLB in ablations, and attains better probing signals for multimodal alignment (e.g., higher attention to visual coreference and relations).
- Applying Villa to LXMERT (finetuning only) yields average gains of about +0.88 across VQA, GQA, and NLVR2.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。