QUICK REVIEW

[論文レビュー] Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu|arXiv (Cornell University)|Nov 14, 2023

Music and Audio Processing被引用数 21

ひとこと要約

Qwen-Audioは、30以上のタスクと音声タイプで訓練された統一的なマルチタスク音声言語モデルを開発し、タスク固有のファインチューニングなしで多様なベンチマークで強力なゼロショット性能を達成し、対話型のQwen-Audio-Chatを導入します。

ABSTRACT

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

研究の動機と目的

単一のモデルが多様な音声タイプ（スピーチ、自然音、音楽、歌）とタスクを処理できるようにすることで、普遍的な音声理解を促進する。
事前学習を30以上のタスクと8つの言語にスケールさせ、タスク間の知識共有を促進する。
データセット間のワン・ツー・マニーなラベル干渉を階層的タグ条件付けフレームワークで対処する。
マルチタスク事前学習とSRWTを組み合わせると、 grounding（グラウンディング）およびASRとQA能力が向上することを示す。
オープンソースの基盤と、マルチターン対話のための指示調整済みチャット変種（Qwen-Audio-Chat）を提供する。

提案手法

複数の音声タイプを処理する単一の音声エンコーダー（ Whisper-large-v2 初期化、640M パラメータ）を使用し、表現をデコーダベースのLLM（Qwen-7B）へ供給してテキスト生成を行う。
知識共有を可能にし、ラベル干渉を減らすために、デコーダを階層的タグのシーケンスで条件付けるマルチタスク事前訓練戦略を採用する。
タスク間の整合性とグラウンディングを向上させるため、Word-level Timestamps を使った Speech Recognition (SRWT) タスクを導入する。
転写、言語、タスク、出力言語、タイムスタンプ、出力指示タグを含むマルチタスク訓練形式を設計し、多様なデータセットを統合する。
監視付きファインチューニングを行いQwen-Audio-Chatを得て、音声/テキスト入力とChatMLスタイルのフォーマットによるマルチターン対話を可能にする。

Figure 1 : Performance of Qwen-Audio and previous top-tiers from multi-task audio-text learning models such as SpeechT5 (Ao et al., 2021 ) , SpeechNet (Chen et al., 2021 ) , SpeechLLaMA (Wu et al., 2023a ) , SALMONN (Anonymous, 2023 ) and Pengi (Deshmukh et al., 2023 ) . We demonstrate the test set

実験結果

リサーチクエスチョン

RQ1タスク固有のファインチューニングなしで、単一のマルチタスク音声言語モデルが多様な音声タイプを横断して30以上のタスクを効果的に学習できるか？
RQ2階層的タグ付けはマルチタスク事前訓練中のワン・ツー・マニーなラベル干渉を緩和するか？
RQ3SRWTの組み込みはASRおよびグラウンディングベースのQAタスクでのグラウンディングと性能を向上させるか？
RQ4指示調整済みファインチューニングは、混合音声/テキスト対話に対して堅牢な対話型チャットモデル（Qwen-Audio-Chat）を生み出せるか？

主な発見

Qwen-Audioは、タスク固有のファインチューニングなしで、いくつかのデータセット（例：Aishell1、cochlscene、ClothoAQA、VocalSound）で最先端の結果を達成する。
ASRベンチマークでは、Librispeechのtest-cleanで2.0% WER、test-otherで4.2%、Mandarinデータセットで1.8/4.0を達成し、従来のマルチタスクモデルを上回る。
S2TTでは、CoVoST2の方向性が、複数の言語ペアにおいて基準より有意なBLEUの改善を示す。
SRWTの統合は、音声タスクにおけるASRグラウンディングとQA性能を改善し、自然音と音楽に対するAQA結果を強化する。
Qwen-Audio-Chatは、音声とテキスト入力を用いたマルチターン対話を可能にし、対話的ケースとオープンソースデモで実証されている。

Figure 2 : Examples of Qwen-Audio showcasing its proficiency in perceiving and comprehending various types of audio. Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage for speech editing. Demos are available at https://qwen-audio.github

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。