QUICK REVIEW

[论文解读] Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen|ArXiv.org|Feb 19, 2025

Semiconductor Lasers and Optical Devices被引用 45

一句话总结

Qwen2.5-VL 是一款旗舰级的视觉-语言模型，具备原生动态分辨率、绝对时间时序编码、窗口化 ViT 编码器，以及在文档、定位和长视频方面的出色能力，提供三种尺寸。

ABSTRACT

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

研究动机与目标

提升 LVLM 的细粒度感知，构建健壮、具备代理能力的视觉模型。
通过视觉编码器的原生分辨率处理和窗口注意力提高效率和可扩展性。
通过准确的事件定位，使文档解析、定位和长视频理解更健壮。
通过绝对时间对齐的 MRoPE 与动态 FPS 采样来增强时序建模。
扩展预训练数据规模，实施强健的数据筛选和后训练对齐，以提升泛化能力。

提出的方法

重新设计的 Vision Transformer，采用窗口注意力，在原生分辨率下运行并降低计算量。
引入原生动态分辨率和动态 FPS 采样，以适应不同图像大小和长视频。
扩展 Multimodal Rotary Position Embedding (MRoPE)，使时间标识与绝对时间对齐，以获得更好的时序学习。
从头开始对 ViT 进行预训练，并在后期使用大型 LLM 进行微调，总计达到 4.1T tokens，序列长度为 32,768。
通过有监督微调（SFT）和直接偏好优化（DPO）结合多模态指令数据进行后训练对齐。
数据策划与过滤流水线包括领域特定的 QA 分类、基于规则和模型的过滤，以及用于增强推理的拒绝采样。

实验结果

研究问题

RQ1Qwen2.5-VL 如何在保持语言能力的同时提升细粒度的视觉感知与定位？
RQ2原生动态分辨率和绝对时间时序编码是否能够在不进行任务特定微调的情况下实现对长视频和文档的高效、准确的多模态理解？
RQ3窗口注意力和 2D RoPE 对图像和视频输入的可扩展性与性能有何影响？
RQ4多样化、经精心筛选的预训练数据（高达 4T tokens）以及强健的后训练对齐如何影响跨领域泛化？
RQ5Qwen2.5-VL 在计算机和移动设备上的代理式任务能力如何？

主要发现

该模型在准确边界框、点和 JSON 格式方面实现了强健的定位和文档解析。
它支持超长视频理解，具备秒级事件定位和原生动态分辨率。
三种模型尺寸（3B、7B、72B）提供有竞争力的性能，72B 在文档和图表理解方面达到顶级模型的水平。
Vision Transformer 从头训练，采用窗口注意力，在不牺牲原生分辨率处理的前提下实现高效性。
预训练数据从 1.2T tokens 扩展到约 4T tokens，并使用动态采样来平衡计算负载。
后训练对齐结合 SFT 与 DPO，以改善多模态任务中的指令遵循和偏好对齐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。