QUICK REVIEW

[논문 리뷰] Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen|ArXiv.org|2025. 02. 19.

Semiconductor Lasers and Optical Devices인용 수 45

한 줄 요약

Qwen2.5-VL은 고유 동적 해상도(native dynamic resolution), 절대시간 temporal encoding, 창(windowed) ViT 인코더, 그리고 강력한 문서 이해, grounding, 및 장기 비디오 능력을 갖춘 대표적인 비전-언어 모델이며 세 가지 크기로 제공됩니다.

ABSTRACT

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

연구 동기 및 목표

LVLM을 위한 고해상도 인지능력 향상을 통해 강건하고 에이전트-가능한 시각 모델 구축.
비전 인코더의 네이티브 해상도 처리 및 창 주의(attention)로 효율성 및 확장성 개선.
정확한 이벤트 로컬라이제이션을 통한 강건한 문서 파싱, grounding, 및 장기간 비디오 이해 지원.
절대시간 정렬 MRoPE 및 동적 FPS 샘플링을 통한 시간 모델링 강화.
일반화 향상을 위한 대규모 프리트레이닝 데이터 확장 및Robust한 데이터 큐레이션 및 포스트 트레이닝 정렬 구현

제안 방법

네이티브 해상도에서 작동하고 연산을 줄이기 위해 창(window) 주의(attention)를 갖춘 재설계된 Vision Transformer.
다양한 이미지 크기와 긴 비디오를 처리하기 위한 네이티브 동적 해상도(native dynamic resolution) 및 동적 FPS 샘플링 도입.
시간 ID를 절대 시간과 정렬하기 위해 Multimodal Rotary Position Embedding(MRoPE)을 확장하여 시계열 학습을 개선.
ViT를 처음부터 프리트레이닝하고 대형 LLM으로 미세조정하여 후반 단계에서 최대 4.1T 토큰, 32,768 시퀀스 길이 달성.
다중 모달 지시 데이터로 SFT(감독 미세조정) 및 DPO(직접 선호 최적화)를 통한 포스트 트레이닝 정렬.
도메인별 QA 분류, 규칙 및 모델 기반 필터링, 강화된 추론을 위한 거절 샘플링을 포함한 데이터 큐레이션 및 필터링 파이프라인

실험 결과

연구 질문

RQ1Qwen2.5-VL이 어떻게 미세한 시각 인지 및 grounding을 향상시키면서 언어 능력을 유지할 수 있는가?
RQ2네이티브 동적 해상도 및 절대시간 시간 인코딩이 태스크 특정 튜닝 없이도 긴 비디오와 문서의 다중 모달 이해를 효율적이고 정확하게 가능하게 하는가?
RQ3창 주의(window attention)와 2D RoPE가 이미지와 비디오 입력의 확장성과 성능에 어떤 영향을 미치는가?
RQ4다양하고 선별된 프리트레이닝 데이터(최대 4T 토큰)와 robust 포스트 트레이닝 정렬이 교차 도메인 일반화에 어떤 영향을 주는가?
RQ5컴퓨터 및 모바일 장치에서의 에이전트형 작업에서 Qwen2.5-VL의 가능성은 무엇인가?

주요 결과

모델은 정확한 경계 상자, 포인트, JSON 형식으로 강력한 grounding 및 문서 파싱을 달성합니다.
초장기 비디오 이해를 지원하며 두 번째 수준의 이벤트 로컬라이제이션과 고유의 동적 해상도를 제공합니다.
3B, 7B, 72B의 세 가지 모델 크기가 경쟁력을 갖춘 성능을 제공하며, 72B는 문서 및 다이어그램 이해에서 최상위 모델과 대등한 성능을 보입니다.
창 주의가 적용된 ViT를 처음부터 학습시켜 원래 해상도 처리를 포기하지 않으면서도 효율성을 달성합니다.
프리트레이닝 데이터가 1.2T에서 약 4T 토큰으로 확장되고 동적 샘플링으로 계산 부하를 균형 있게 조정합니다.
포스트 트레이닝 정렬은 SFT와 DPO를 결합하여 다중 모달 작업에서 지시 수행 및 선호 정렬을 향상시킵니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.