QUICK REVIEW

[论文解读] Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu|arXiv (Cornell University)|Nov 14, 2023

Music and Audio Processing被引用 21

一句话总结

Qwen-Audio 开发了一个统一的多任务音频-语言模型，基于 30+ 任务和多种音频类型进行训练，在没有任务特定微调的情况下，在多样化基准上实现强零样本性能，并引入 Qwen-Audio-Chat 进行交互式对话。

ABSTRACT

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

研究动机与目标

通过让单一模型处理多种音频类型（语音、自然声音、音乐、歌曲）和任务，来促成通用音频理解。
将预训练扩展到 30+ 任务和 8 种语言，以促进跨任务知识共享。
通过分层标签条件框架解决跨数据集的一对多标签干扰。
展示多任务预训练加 SRWT 能提升对齐、ASR 和 QA 能力。
提供开源基础以及一个指令微调的聊天变体（Qwen-Audio-Chat）用于多轮对话。

提出的方法

使用一个音频编码器（ Whisper-large-v2 初始化，640M 参数）来处理多种音频类型，并将表示输入到一个基于解码器的 LLM（Qwen-7B）以进行文本生成。
采用多任务预训练策略，使解码器在一系列分层标签上进行条件化，以实现知识共享并减少标签干扰。
引入带有逐词时间戳的语音识别（SRWT）任务，以改善跨任务的对齐和定位。
设计一种包含转录、语言、任务、输出语言、时间戳和输出指令标签的多任务训练格式，以统一多样数据集。
执行有监督微调以获得 Qwen-Audio-Chat，使其能够通过 ChatML 风格格式进行音频/文本输入和多轮对话。

Figure 1 : Performance of Qwen-Audio and previous top-tiers from multi-task audio-text learning models such as SpeechT5 (Ao et al., 2021 ) , SpeechNet (Chen et al., 2021 ) , SpeechLLaMA (Wu et al., 2023a ) , SALMONN (Anonymous, 2023 ) and Pengi (Deshmukh et al., 2023 ) . We demonstrate the test set

实验结果

研究问题

RQ1单一的多任务音频-语言模型在不进行任务特定微调的情况下，是否能有效学习 30+ 项任务，覆盖多种音频类型？
RQ2分层标签在多任务预训练中是否能缓解一对多标签干扰？
RQ3将 SRWT 纳入是否能提升对齐和在 ASR 及基于对齐的 QA 任务上的表现？
RQ4指令微调是否能产生一个对混合音频/文本对话具备鲁棒性的交互式聊天模型（Qwen-Audio-Chat）？

主要发现

Qwen-Audio 在多个数据集（如 Aishell1、cochlscene、ClothoAQA、VocalSound）上取得了最先进的结果，且无需任务特定微调。
在 ASR 基准测试中，Qwen-Audio 在 Librispeech test-clean 上达到 2.0% WER，在 test-other 上达到 4.2%，在普通话数据集上为 1.8/4.0，超越先前的多任务模型。
在 S2TT 中，CoVoST2 方向在多对语言对上相对于基线显示出显著的 BLEU 提升。
SRWT 集成在音频任务上提升了 ASR 的对齐和 QA 表现，并提升自然声音和音乐的 AQA 结果。
Qwen-Audio-Chat 使音频与文本输入的多轮对话成为可能，通过交互案例和开源演示进行展示。

Figure 2 : Examples of Qwen-Audio showcasing its proficiency in perceiving and comprehending various types of audio. Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage for speech editing. Demos are available at https://qwen-audio.github

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。