QUICK REVIEW

[论文解读] Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Feng Li, Hao Zhang|arXiv (Cornell University)|Mar 3, 2022

Multimodal Machine Learning Applications被引用 30

一句话总结

本文面向三个时代的视觉-语言智能进行综述——任务特定方法、视觉-语言预训练，以及大规模模型，并概述核心组成部分与未来方向。

ABSTRACT

This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension. We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data. We first take some common VL tasks as examples to introduce the development of task-specific methods. Then we focus on VLP methods and comprehensively review key components of the model structures and training methods. After that, we show how recent work utilizes large-scale raw image-text data to learn language-aligned visual representations that generalize better on zero or few shot learning tasks. Finally, we discuss some potential future trends towards modality cooperation, unified representation, and knowledge incorporation. We believe that this review will be of help for researchers and practitioners of AI and ML, especially those interested in computer vision and natural language processing.

研究动机与目标

追踪视觉-语言学习的三个历史阶段（任务特定方法、视觉-语言预训练，以及大规模弱标注数据）。
分析核心VL任务（如图像标题生成/描述、VQA、图像-文本匹配）及其发展历程。
解释视觉-语言预训练的架构与训练组件（VE、TE、MF）及模型趋势。
讨论大规模数据和弱监督如何使零-shot和少-shot泛化成为可能。
概述模态协作、统一表示与知识整合的未来趋势。

提出的方法

回顾任务特定的VL问题并总结输入/输出、数据集、评测指标和主流方法。
解释视觉-语言预训练（VLP）范式及其关键组成部分（视觉/文本嵌入、模态融合、基于Transformer的训练）。
讨论单流与双流VLP模型架构及跨模态注意力机制。
描述大规模图像-文本数据及对比学习如何实现与语言对齐的视觉表征。
总结预训练在实现下游任务迁移以及零-shot/少-shot能力中的作用。

实验结果

研究问题

RQ1主要的任务特定VL问题有哪些，它们如何演变？
RQ2视觉-语言预训练模型如何学习联合表征，以及它们的架构模式？
RQ3大规模弱标注图像-文本数据对零-shot和少-shot泛化的影响？
RQ4模态协作、统一表示与知识整合的潜在趋势是什么？

主要发现

VL 研究经历三个阶段：任务特定方法、基于VLP的联合表征，以及带有弱标注数据的大模型方法。
VLP模型旨在通过预训练学习对象级、语言对齐且语义丰富的视觉表征。
Transformer-based 架构以及跨模态掩蔽/训练推动了VL预训练的成功。
大规模的图像-文本数据与对比学习支撑了强大的零-shot和少-shot能力。
模型架构通常分为双流（分离的VE/TE，可选融合）和单流（统一编码器）设计。
基于区域的特征（如 Faster R-CNN）和注意力机制显著提升如VQA和字幕/描述等VL任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。