QUICK REVIEW

[论文解读] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai|arXiv (Cornell University)|Aug 24, 2023

Multimodal Machine Learning Applications被引用 132

一句话总结

Qwen-VL 是一个基于 Qwen-7B 的多语言视觉-语言模型家族，配备视觉编码器和位置感知适配器，在多样化的以视觉为中心的任务中达到最先进的性能，并支持多图像输入与对齐，并有一个指令调优的 Chat 变体。

ABSTRACT

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

研究动机与目标

推动开源 LVLMs 的开发，使其能够感知并理解图像与文本。
引入一个紧凑的视觉感受器和一个三阶段训练流程，以从 Qwen-7B 构建 Qwen-VL。
实现包括定位与 OCR 在内的细粒度视觉理解，通过边界框标注。
交付具备多语言支持和多图像输入的 Qwen-VL 与 Qwen-VL-Chat，用于真实世界对话。
在广泛的视觉-语言基准测试上展示具有竞争力的或最先进的表现。

提出的方法

以基于 Qwen-7B 的大模型作为基础。
新增基于 Vision Transformer 的视觉编码器，使用 OpenClip 的 ViT-bigG 进行初始化。
融入一个位置感知的 VL 适配器，通过带可训练查询向量的跨注意力将图像特征压缩至 256。
提供特殊标记来标记图像特征和边界框文本，用于定位任务。
分三阶段训练：阶段 1 在大型图文对上进行预训练，LLM 冻结；阶段 2 进行高分辨率和交错数据的多任务预训练；阶段 3 通过指令微调得到 Qwen-VL-Chat。

实验结果

研究问题

RQ1开源 LVLMs 是否能在中等规模的模型下，在字幕生成、VQA、定位以及文本为导向的任务上达到具有竞争力的表现？
RQ2高分辨率视觉编码器加上轻量级 VL 适配器是否能提升细粒度感知和定位？
RQ3多任务预训练和指令调优在多语言、多图像和定位能力上的迁移效果有多高？
RQ4在指称表达理解和 OCR 相关任务中，定位与文本读取能力的增益如何？

主要发现

Qwen-VL 与 Qwen-VL-Chat 在相近规模的广泛以视觉为主的基准上达到顶级准确度。
Qwen-VL 在 Flickr30K 零-shot captioning 达到 85.8 CIDEr，超过更大规模的模型。
Qwen-VL 在 VQA 基准测试（VQAv2 79.5，OKVQA 58.6，GQA 59.3）以及文本导向的 VQA（OCR-VQA、TextVQA、DocVQA）上表现出色。
指称表达理解结果在 RefCOCO、RefCOCO+、RefCOCOg、GRIT 上处于最先进水平。
在选定的 VL 任务中，使用 Qwen-VL 的少样本上下文学习接近更大模型的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。