[论文解读] Qwen2.5-Omni Technical Report
Qwen2.5-Omni 是一个端到端的多模态模型,处理文本、图像、音频和视频,并通过一个 Thinker-Talker 架构以及带有分块流式编码器和 TMRoPE 位置嵌入的生成流文本和语音。
In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose extbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.
研究动机与目标
- 激发并开发一个统一的 omni-model,使其能够实时感知多模态信息。
- 提出通过共享注意力融合模态的架构和编码方案。
- 实现文本和自然语音的低延迟流式生成。
- 展示端到端的多模态任务训练和推理。
- 在文本、语音和多模态评估套件上基准测试性能。
提出的方法
- 引入 TMRoPE(Time-aligned Multimodal RoPE),对音频和视频模态之间的时间对齐进行编码。
- 采用 Thinker-Talker 架构,其中 Thinker 生成文本,Talker 自回归地输出使用来自 Thinker 的表示的流式语音。
- 实现音频和视觉编码器的分块流式处理,以支持预填充并降低初始潜伏期。
- 使用基于 DiT 的滑动窗口流式编解码器,结合 Flow-Matching 将标记转换为波形,同时约束感受野。
- 分三个阶段进行预训练,利用现有的 Qwen 组件进行初始化,并通过长序列扩展多模态数据。
- 使用 ChatML 的指令跟随数据进行训练,并结合强化学习以稳定语音生成并提升自然度。
实验结果
研究问题
- RQ1单个模型如何端到端地有效感知并融合文本、音频、图像和视频信息?
- RQ2能否实现文本和语音的流式生成并行进行且不跨模态干扰?
- RQ3哪些架构与训练策略在保持任务高性能的同时尽量降低初始潜伏期?
- RQ4与同等规模的单模态模型相比,该模型在多模态基准测试中的表现如何?
- RQ5时间对齐与交错对视频-音频理解的影响是什么?
主要发现
- Qwen2.5-Omni 在 Omni-Bench 等多模态基准测试中达到最先进水平。
- 该模型的端到端语音指令跟随在基准测试如 MMLU 和 GSM8K 上与其文本输入能力相匹配。
- 通过流式 Talker 进行的语音生成在鲁棒性和自然度方面优于许多现有的流式和非流式方法。
- 与同等规模的模型相比,Qwen2.5-Omni 在文本、音频、图像和视频任务上表现具有竞争力或更优。
- 分块流式编码器和基于滑动窗口的 DiT 编解码器降低了流式音频输出的初始延迟。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。