QUICK REVIEW

[論文レビュー] Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Jinchuan Tian, Haoran Wang|arXiv (Cornell University)|Feb 5, 2026

Music and Audio Processing被引用数 0

ひとこと要約

Bagpiperは、豊富なキャプションを普遍的な意味インターフェースとして用いる8B音声基盤モデルで、オープンエンドな音声タスクを共同理解・生成する。音声-キャプションの双方向マッピングが強く、従来モデルに比べて生成性能が優れている。

ABSTRACT

Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.

研究の動機と目的

オープンエンドな音声タスクに対して、理解と生成の両方を豊富な自然言語キャプションに基づいて grounded することで、普遍的で全体的なアプローチを促進する。
豊富なキャプションが物理的な音声信号と認知概念の双方向ブリッジとして機能する方法を学ぶ。
Bagpiperの事前学習および supervised fine-tuningを、理解タスクと生成タスクの両方に対してベースラインと比較して評価する。

提案手法

音声とテキスト処理のためにQwen-3ファミリから初期化されたEncoder-Adaptor-LLMアーキテクチャを使用。
600Bトークンで事前学習を行い、300Bテキスト→音声：150B音声→テキスト：150Bテキストのみデータの混合で音声とリッチキャプション間の双方向マッピングを学習。
音声クリップに対して豊富なキャプションを生成し、CoT推論を含むキャプション-処理データフローを通じてオープンエンドタスクを解決する。
データ収集とGEMINIキャプショニングパイプラインでファインチューニングを行い、理解サンプル845kと生成サンプル1.47Mを作成・フィルタリング。
音声生成には classifier-free guidanceを適用し、波形再構成のための音声コーデックトークンボコーダを使用。
強力なベースラインに対して双方向マッピングプローブ、サイクル整合性テスト、オープンエンドタスクベンチマークで評価。

実験結果

リサーチクエスチョン

RQ1豊富なキャプションは、タスク特異的な事前知識なしに、理解と生成を統一したモデルを実現できるか。
RQ2音声信号と豊富なキャプション間の双方向マッピングは、認識と生成の情報をどの程度保持するか。
RQ3事前学習とSFTは、タスク特化モデルと比較して音声理解のベンチマークと生成品質で競争力を持つのか。

主な発見

モデル	パラメータ	WER (↓)	MMAU-Mini (↑)	AIR-Bench-chat	AudioBench
Qwen3-Captioner 30B-A3B	-	5.5	71.1	-	-
Bagpiper-Base 8B	8B	5.0	69.0	-	-
Bagpiper-Base 8B	8B	2.5	74.5	6.57	70.39

Bagpiper-Base (8B)は理解プローブでQwen3-Captioner (30B)と同等であり、音声と豊富なキャプション間の強い双方向翻訳を示す。
Bagpiper-Baseは、リッチキャプションでプロンプトした場合、TTS様およびTTAシナリオを含む専門ベースラインと同等以上の音声生成忠実度を達成する。
ファインチューニング後のBagpiperはAIR-BenchとAudioBenchで7B Qwen-2.5-Omniを上回り、オープンエンド理解で優れたパフォーマンスを示し、生成タスクでも競争力を維持。
ファインチューニング後の音声理解では、BagpiperはMMAU-MiniでWER2.5、MMAU-Miniオープンエンド評価で74.5を達成し、統一タスク設定の一部ベースラインを凌駕。
Bagpiper (8B)によるテキスト音声合成生成はLibriSpeech Test-CleanでWER2.7を達成し、この設定でCosyVoice3を上回る。
Bagpiperは構成的で多話者、音楽、効果音を豊富に含む生成を可能にし、長い指示中心のプロンプトでベースラインを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。