QUICK REVIEW

[论文解读] WhispSynth: Scaling Multilingual Whisper Corpus through Real Data Curation and A Novel Pitch-free Generative Framework

Tianyi Tan, Jiaxin Ye|arXiv (Cornell University)|Mar 16, 2026

Speech Recognition and Synthesis被引用 0

一句话总结

WhispSynth 通过真实数据整理（WhispReal）和无音高的 DDSP-TTS 流水线，创建大型多语言耳语语料库，实现高保真文本到耳语合成，并能微调 CosyWhisper 模型达到接近真实自然度。

ABSTRACT

Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity generative framework. Specifically, we propose a pipeline integrating Differentiable Digital Signal Processing (DDSP)-based pitch-free method with Text-to-Speech (TTS) models. This framework refines a comprehensive collection of resources, including our newly constructed WhispNJU dataset, into 118 hours of high-fidelity whispered speech from 479 speakers. Unlike standard synthetic or noisy real data, our data engine faithfully preserves source vocal timbre and linguistic content while ensuring acoustic consistency, providing a robust foundation for text-to-whisper research. Experimental results demonstrate that WhispSynth exhibits significantly higher quality than existing corpora. Moreover, our CosyWhisper, tuned with WhispSynth, achieves speech naturalness on par with ground-truth samples. The official implementation and related resources are available at https://github.com/tan90xx/cosywhisper.

研究动机与目标

解决由于低幅、难以记录的耳语语音带来的数据瓶颈。
通过整理六个公开耳语语料库以及新增的 Mandarin 数据集（WhispNJU）来构建 WhispReal。
开发一个可扩展的数据引擎（WhispSynth），从 479 位说话人中产出 118 小时高保真耳语语音。
提出一种无音高生成框架，将基于 DDSP 的无音高处理与 TTS 模型结合，保留音色与内容。
证明在 WhispSynth 上训练可以提升耳语合成质量，并使 CosyWhisper 达到接近真实自然度的水平。

提出的方法

通过汇集六个公开语料并加入 WhispNJU 来创建 WhispReal；提供标准化的分割和元数据。
将 CosyVoice3 作为耳语合成的 TTS 主干。
应用基于 DDSP 的无音高后处理流水线，在合成耳语中去除残留音高，同时保留噪声般的特性。
引入对抗训练与两阶段训练（先普通语音，再耳语语音）以实现无音高 DDSP 音箱的训练。
微调 CosyVoice3 以生成 CosyWhisper，重点将语义标记转化为耳语声学特征，同时保持 LM 与 HiFi-GAN 固定。

Figure 1: Different Dynamic Range. Whispers exhibit a significantly lower sound pressure level compared to normal speech, even when the linguistic content is identical.

实验结果

研究问题

RQ1如何在保持授权与音色保真度的前提下，从真实数据构建一个大规模、可扩展的多语言耳语语料库？
RQ2将无音高的 DDSP 后处理方法与现代 TTS（CosyVoice3）相结合，是否能从真实数据或文本中产生高质量的耳语？
RQ3与真实耳语语料库相比，在 WhispSynth 上的训练是否能提升耳语的可懂度和自然度？
RQ4在 WhispSynth 上训练的 CosyWhisper 相较于真实耳语在自然度和可懂度方面如何？

主要发现

WhispReal 汇集了多个耳语语料库并新增 WhispNJU，形成标准化分割的 118 小时 WhisperReal 数据集。
WhispSynth 将 DDSP 无音高处理与 CosyVoice3 结合，产生高保真耳语语料库（约 118 小时），保留音色与内容。
WhispSynth 提升文本到耳语的可懂度（CER/WER 下降），并在自然度（DNSMOS/UTMOS）方面保持与真实 Whisper 数据集的竞争力。
CosyWhisper 在 WhispSynth 微调后，其耳语自然度达到接近真实耳语，并在耳语真实感方面超越 CosyVoice3。
消融实验表明 WhispSynth 在自然度与可懂度方面优于基线，在 CER/WER 与 VTR 控制方面实现显著提升。

Figure 2: Visualization of the WhispSynth’s generation pipeline. We apply CosyVoice3 and DDSP Model to generate pitch-free voice.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。