QUICK REVIEW

[論文レビュー] WhispSynth: Scaling Multilingual Whisper Corpus through Real Data Curation and A Novel Pitch-free Generative Framework

Tianyi Tan, Jiaxin Ye|arXiv (Cornell University)|Mar 16, 2026

Speech Recognition and Synthesis被引用数 0

ひとこと要約

WhispSynth は real-data クレンジング（WhispReal）と pitch-free DDSP-TTS パイプラインにより大規模な多言語ウィスパーコーパスを作成し、高忠実度のテキスト-to-ウィスパー合成を実現、CosyWhisper モデルを微調整可能にする。

ABSTRACT

Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity generative framework. Specifically, we propose a pipeline integrating Differentiable Digital Signal Processing (DDSP)-based pitch-free method with Text-to-Speech (TTS) models. This framework refines a comprehensive collection of resources, including our newly constructed WhispNJU dataset, into 118 hours of high-fidelity whispered speech from 479 speakers. Unlike standard synthetic or noisy real data, our data engine faithfully preserves source vocal timbre and linguistic content while ensuring acoustic consistency, providing a robust foundation for text-to-whisper research. Experimental results demonstrate that WhispSynth exhibits significantly higher quality than existing corpora. Moreover, our CosyWhisper, tuned with WhispSynth, achieves speech naturalness on par with ground-truth samples. The official implementation and related resources are available at https://github.com/tan90xx/cosywhisper.

研究の動機と目的

低振幅で記録が難しいウィスパー発話によるデータ瓶の bottleneck を解消する。
複数の公開ウィスパーコーパスと新しい Mandarin データセット（WhispNJU）をキュレーションして WhispReal を構築する。
479 人の話者から 118 時間の高忠実度ウィスパー発話を生み出すスケーラブルなデータエンジン（WhispSynth）を開発する。
ピッチフリーの DDSP ベースの処理と TTS モデルを組み合わせて音色と内容を保つ生成フレームワークを提案する。
WhispSynth での学習がウィスパー合成の品質を向上させ、 CosyWhisper がほぼ地上真実の自然さを達成できることを示す。

提案手法

WhispReal を 6 つの公開コーパスを統合し WhispNJU を追加して作成する；標準化したスプリットとメタデータを提供する。
CosyVoice3 をウィスパー合成の TTS バックボーンとして使用する。
DDSP ベースのピッチフリー後処理パイプラインを適用して、合成ウィスパーから残存ピッチを除去しつつノイズ様の品質を保持する。
ピッチフリー DDSP ボコーダのための対向訓練と二段階訓練（通常音声を先に、その後ウィスパー音声）を組み込む。
CosyVoice3 を WhispSynth で微調整して CosyWhisper を作成し、意味トークンをウィスパー型の音響特徴へ変換することに焦点を当て、LM および HiFi-GAN を固定する。

Figure 1: Different Dynamic Range. Whispers exhibit a significantly lower sound pressure level compared to normal speech, even when the linguistic content is identical.

実験結果

リサーチクエスチョン

RQ1実データから licensing と音色忠実度を維持しつつ、大規模で多言語のウィスパー発話コーパスを構築できるか？
RQ2ピッチフリーの DDSP ベースの後処理アプローチと現代的な TTS（CosyVoice3）を組み合わせて、実データやテキストから高品質なウィスパー発話を生成できるか？
RQ3WhispSynth での学習は実ウィスパーコーパスと比較して聴取可能性と自然さを改善するか？
RQ4WhispSynth で学習した CosyWhisper は地上真実のウィスパーと自然さ・聴取可能性でどの程度比較されるのか？

主な発見

WhispReal は複数のウィスパーコーパスと WhispNJU を統合して 118 時間の WhisperReal データセットを標準化スプリットと共に作成する。
WhispSynth は DDSP ピッチフリー処理と CosyVoice3 を組み合わせて、音色と内容を保持したまま高忠実度のウィスパーコーパス（約 118 時間）を生成する。
WhispSynth はテキスト-to-ウィスパーの intelligibility を改善（CER/WER の低減）し、実データの Whisper データセットと比較して自然さ（DNSMOS/UTMOS）も競合力を維持する。
CosyWhisper は WhispSynth で微調整され、地上真実のウィスパーと同等の自然さを達成し、ウィスパー現実感で CosyVoice3 を上回る。
アブレーションにより WhispSynth が自然さと可聴性でベースラインを上回り、CER/WER の低減と VTR コントロールで顕著な改善を示す。

Figure 2: Visualization of the WhispSynth’s generation pipeline. We apply CosyVoice3 and DDSP Model to generate pitch-free voice.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。