QUICK REVIEW

[论文解读] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai|arXiv (Cornell University)|Sep 18, 2024

Categorization, perception, and language被引用 71

一句话总结

Qwen2-VL 引入 Naive Dynamic Resolution 与 Multimodal Rotary Position Embedding，以在不同分辨率下处理图像和视频，将 LVLM 的规模扩展到 72B 参数，并在视频理解与多语言 OCR 等多模态基准测试中取得出色表现。

ABSTRACT

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

研究动机与目标

推动突破视觉-语言模型中的固定分辨率瓶颈，以更好地模拟人类感知尺度。
开发一个统一的图像-视频多模态框架，能够处理不同分辨率的输入。
通过探索模型规模（2B、8B、72B）和训练数据量，研究 LVLM 的扩展规律。
增强位置编码，以在跨模态有效融合文本、图像和视频信息。
在单一模型中展示多语言、OCR、文档理解、视频理解和智能代理能力。

提出的方法

引入 Naive Dynamic Resolution，使用 2D-RoPE 将任意分辨率的图像转换为动态数量的视觉标记。
用 2D Rotary Position Embedding (2D-RoPE) 替代绝对的 2D 位置嵌入，以捕捉空间信息。
提出 Multimodal RoPE (M-RoPE)，将旋转嵌入分解为时间、高度和宽度分量以实现多模态融合。
使用统一的图像-视频训练方案，结合 3D 卷积和帧采样来处理长视频，同时保持标记数限制。
采用三阶段训练流程（ViT 预训练、全模型解冻、LLM 指令微调）并使用截至 2023 年 6 月的多样化多模态数据集。
在 2B、7B 和 72B 的 LLMs 上，使用 675M Vision Transformer 主干的统一 Qwen2-VL 架构。

实验结果

研究问题

RQ1动态分辨率如何影响不同分辨率下的视觉标记效率与模型感知？
RQ2M-RoPE 与 2D-RoPE 能否提升文本、图像和视频之间的跨模态融合？
RQ3在增大模型规模和数据量时，LVLMs 在多模态基准上的准确性的扩展效应如何？
RQ4统一的图像-视频框架能否在 OCR、文档理解和视频理解任务上达到最先进水平？
RQ5在公开和内部基准测试中，多语言和 OCR 能力与现有 LVLMs 相比如何？

主要发现

Qwen2-VL-72B 在多模态基准测试中与领先模型如 GPT-4o 和 Claude3.5-Sonnet 竞争性结果。
Qwen2-VL 在 DocVQA、InfoVQA、TextVQA 和 OCRBench 上达到最先进的性能。
该模型在多语言 OCR 和视频理解方面表现出色，在 MTVQA 和内部基准测试中超越了许多通用 LVLMs。
文档和图表阅读任务在 OCR 相关指标上显示显著提升。
视频理解基准测试显示 72B 模型在若干任务上交付顶级结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。