QUICK REVIEW

[논문 리뷰] Qwen2.5-Omni Technical Report

Jin Xu, Zihan Guo|ArXiv.org|2025. 03. 26.

Embedded Systems and FPGA Design인용 수 5

한 줄 요약

Qwen2.5-Omni는 텍스트, 이미지, 오디오, 비디오를 처리하고 Thinker-Talker 아키텍처와 블록 단위 스트리밍 인코더 및 TMRoPE 위치 임베딩으로 스트리밍 텍스트와 음성을 생성하는 엔드-투-엔드 다중 모달 모델이다.

ABSTRACT

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose extbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

연구 동기 및 목표

실시간으로 여러 모달리티를 인지할 수 있는 통합 올모-모델을 동기부여하고 개발한다.
모달리티를 공유 어텐션으로 융합하기 위한 아키텍처 및 인코딩 방식을 제안한다.
지연 시간을 최소화하며 텍스트와 자연스러운 음성의 스트리밍 생성을 가능하게 한다.
다중 모달 작업을 위한 엔드 투 엔드 학습 및 추론을 시연한다.
텍스트, 음성 및 다중 모달 평가 스위트에서 성능 벤치마크를 수행한다.

제안 방법

오디오와 비디오 모달리티 간의 시간 정합성을 인코딩하기 위해 TMRoPE(Time-aligned Multimodal RoPE)를 도입한다.
Thinker가 텍스트를 생성하고 Talker가 Thinker의 표현으로부터 스트리밍 음성을 자회귀적으로 출력하는 Thinker-Talker 아키텍처를 채택한다.
선행 채우기(prefill) 및 초기 지연 시간 감소를 지원하기 위해 오디오 및 비주얼 인코더에 대한 블록 단위 스트리밍 처리를 구현한다.
수용 필드를 제약하면서 토큰을 파형으로 변환하기 위해 Flow-Matching을 이용한 DiT 기반 슬라이딩 윈도우 스트리밍 코덱을 사용한다.
초기화에 기존 Qwen 구성 요소를 활용하고 긴 시퀀스로 다중 모달 데이터를 확장하는 방식으로 세 단계로 사전 학습한다.
지시대로 수행하는 데이터를 사용한 학습(ChatML) 및 강화 학습을 통해 음성 생성을 안정화하고 자연스러움을 개선한다.

실험 결과

연구 질문

RQ1단일 모델이 실시간으로 텍스트, 오디오, 이미지, 비디오 정보를 효과적으로 인지하고 융합할 수 있는가?
RQ2스트리밍 텍스트와 음성 생성을 서로 간섭 없이 공동으로 달성할 수 있는가?
RQ3초기 지연 시간을 최소화하면서도 작업 전반에서 높은 성능을 유지하는 아키텍처적 및 학습 전략은 무엇인가?
RQ4유사한 크기의 단일 모달 모델과 비교하여 다중 모달 벤치마크에서 모델의 성능은 어떠한가?
RQ5영상-오디오 이해에 대한 시간 정합성과 인터리빙의 영향은 무엇인가?

주요 결과

Qwen2.5-Omni는 Omni-Bench와 같은 다중 모달 벤치마크에서 최첨단 성능을 달성한다.
모델의 엔드-투-엔드 음성 지시 추종은 MMLU 및 GSM8K와 같은 벤치마크에서 텍스트 입력 능력과 일치한다.
스트리밍 Talker를 통한 음성 생성은 강건성과 자연스러움 면에서 많은 기존의 스트리밍 및 비스트리밍 접근법을 능가한다.
Qwen2.5-Omni는 텍스트, 오디오, 이미지, 비디오 작업에서 유사한 규모의 모델과 비교하여 경쟁력 있거나 우수한 성능을 보인다.
블록 단위 스트리밍 인코더와 슬라이딩 윈도우 DiT 기반 코덱은 스트리밍 오디오 출력의 초기 지연 시간을 줄인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.