QUICK REVIEW

[论文解读] Vision-Language Models for Vision Tasks: A Survey

Jingyi Zhang, Jiaxing Huang|arXiv (Cornell University)|Apr 3, 2023

Multimodal Machine Learning Applications被引用 34

一句话总结

对视觉-语言模型（VLMs）在视觉识别方面的系统综述，涵盖架构、目标、数据集、迁移学习与知识蒸馏，包含基准和未来方向。

ABSTRACT

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.

研究动机与目标

解释从传统到视觉语言预训练的视觉识别范式演进。
总结 VLM 架构、目标和用于评估零-shot 能力的下游任务。
回顾用于 VLM 的大规模图文数据集和评估基准。
分类 VLM 预训练、迁移学习和知识蒸馏方法。
突出 VLM 研究在视觉识别中的挑战与未来方向。

提出的方法

使用基于 CNN 的和基于 Transformer 的图像编码器对图像特征进行分类。
使用标准 Transformer 基的语言编码器对文本特征进行分类。
将预训练目标分为对比学习、生成与对齐等类别，并给出正式损失（如 InfoNCE、L_IT、L_RW、L_MIM、L_MLM、L_MCM）。
解释零-shot 预测、线性探测和用于评估 VLM 的下游任务（分类、检测、分割、检索、动作识别）。
提供用于预训练（如 CLIP、ALIGN、LAION）和评估（如 ImageNet、COCO、PASCAL VOC）的数据集。
调查将 VLM 迁移到下游视觉任务的迁移学习和知识蒸馏方法。

实验结果

研究问题

RQ1VLMs 如何从大规模图文数据中学习视觉-语言相关性，以实现跨视觉识别任务的零-shot 预测？
RQ2哪些网络架构和预训练目标最适合学习 VLM 中的跨模态表示？
RQ3用于预训练和评估 VLM 的数据集有哪些，它们如何影响零-shot 和线性探测设置中的表现？
RQ4哪些迁移学习和知识蒸馏技术最能利用 VLMs 用于下游任务，如检测与分割？
RQ5在视觉识别的 Vision-Language 模型研究中，主要挑战和未来方向是什么？

主要发现

Vision-Language Models 通过从网页级数据学习图像-文本相关性，在多种视觉识别任务上实现零-shot 预测。
VLM 预训练将图像与文本编码器结合，目标涵盖对比、生成与对齐损失，以学习跨模态表示。
广泛的大规模图文数据集（如 CLIP、ALIGN、LAION）及辅助数据支持 VLM 的训练和在多任务上的评估。
迁移学习和知识蒸馏是将 VLM 适配到下游任务（超越零-shot）的重要方向。
此综述提供跨数据集的基准，并讨论 VLM 基于视觉识别的挑战与未来研究方向。
诸如 CLIP 等著名 VLM 的零-shot 性能在 36 个视觉识别任务上展现出色结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。