QUICK REVIEW

[论文解读] Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Jinchuan Tian, Haoran Wang|arXiv (Cornell University)|Feb 5, 2026

Music and Audio Processing被引用 0

一句话总结

Bagpiper 是一个 8B 音频基础模型，使用丰富字幕作为通用语义接口，以共同理解和生成开放式音频任务，显示出强的双向音频-字幕映射能力和相较于先前模型的优越生成。

ABSTRACT

Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.

研究动机与目标

通过以丰富自然语言字幕为基础，将理解与生成统一到一个普遍、整体的方法，以解决开放式音频任务。
了解丰富字幕如何作为物理音频信号与认知概念之间的双向桥梁。
在对比基线的基础上，评估 Bagpiper 的预训练和有监督微调在音频理解与生成任务中的表现。

提出的方法

使用从 Qwen-3 家族初始化的 Encoder-Adaptor-LLM 架构进行音频与文本处理。
以 600B tokens 的预训练数据，混合 300B 的文本到音频、150B 的音频到文本、150B 的文本仅数据，以学习音频与丰富字幕之间的双向映射。
为音频片段生成丰富字幕，并通过字幕-然后处理的数据流训练，包括逐步推理（CoT）以解决开放式任务。
通过数据收集和 GEMINI-captioning 流程进行微调，创建并筛选 845k 的理解样本和 1.47M 的生成样本。
对音频生成应用无分类器引导（classifier-free guidance），并使用音频编解码器标记 vocoder 进行波形重建。
通过双向映射探针、循环一致性测试，以及针对强基线的开放式任务基准进行评估。

实验结果

研究问题

RQ1丰富字幕能否使统一模型在没有任务特定先验的情况下理解并生成多样的开放式音频任务？
RQ2音频信号与丰富字幕之间的双向映射在信息保留方面对识别与生成的表现如何？
RQ3预训练和 SFT 是否能在音频理解基准和生成质量方面与针对任务的模型竞争？

主要发现

Model	Param.	WER (↓)	MMAU-Mini (↑)	AIR-Bench-chat	AudioBench
Qwen3-Captioner 30B-A3B	-	5.5	71.1	-	-
Bagpiper-Base 8B	8B	5.0	69.0	-	-
Bagpiper-Base 8B	8B	2.5	74.5	6.57	70.39

Bagpiper-Base（8B）在理解探针上与 Qwen3-Captioner（30B）相当，显示出音频与丰富字幕之间强烈的双向翻译能力。
Bagpiper-Base 在以丰富字幕为提示时，其音频生成保真度可与专用基线相媲美甚至更好，涵盖 TTS-like 与 TTA 场景。
微调后的 Bagpiper 在 AIR-Bench 和 AudioBench 的开放式理解任务上超越 7B Qwen-2.5-Omni，并在生成任务上保持竞争力。
在微调后的音频理解任务中，Bagpiper 在 MMAU-Mini 上的 WER 为 2.5，在 MMAU-Mini 开放式评分中为 74.5，统一任务设置下超越部分基线。
Bagpiper（8B）的文本转语音生成在 LibriSpeech Test-Clean 上的 WER 为 2.7，优于此设定下的 CosyVoice3。
Bagpiper 能实现组合化、多人、音乐与音效丰富的生成，在长文本、指令密集提示中超越基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。