QUICK REVIEW

[論文レビュー] WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Binbin Zhang, Hang Lv|arXiv (Cornell University)|Oct 7, 2021

Speech Recognition and Synthesis被引用数 32

ひとこと要約

WenetSpeechを紹介する。22435時間の多分野 Mandarin コーパスには10005時間の強いラベル、2478時間の弱いラベル、そして評価セットが含まれ、OCR-および ASR を支援するパイプラインと、Kaldi、ESPnet、WeNet でのエンドツーエンドのラベル誤り検出およびベースラインを提示する。

ABSTRACT

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test\_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

研究の動機と目的

大規模で多様な Mandarin ASR コーパスが現実世界の条件とドメインの多様性を反映する必要性を動機づける。
YouTube および Podcast データから音声/テキスト区間を収集、整合、検証するスケーラブルなパイプラインを提供する。
研究コミュニティのためにベースラインのベンチマークと評価セットを公開する。
半教師ありおよび教師ありトレーニングを明確に注記した信頼度ベースのデータ分割で可能にする。
CC-BY 4.0 の下で非商用利用向けの extensible メタデータとオープンソース提供を行う。

提案手法

YouTube（OCR ベースの字幕抽出）および Podcast（高品質な ASR 文字起こし）からの段階的データ収集。
CTC ベースの force alignment を用いて転写エラーを検出し、ラベル誤り検出のための force decoding グラフを構築。
信頼度スコアリングによりデータを Strong Label、Weak Label、Other に分割して訓練/検証に活用。
セグメントごとの信頼度とソースドメイン tagging を含む JSON 形式の extensive metadata。
Kaldi、ESPnet、WeNet のツールキット向けのベースラインモデルと結果を提供。

Fig. 1 : OCR based YouTube data collection pipeline

実験結果

リサーチクエスチョン

RQ1 Mandarin ASR コーパスは production ライクな堅牢性を支えるにはどれほど大規模かつ多様であるべきか。
RQ2OCR および高品質な ASR 文字起こしパイプラインとエンドツーエンドのラベル誤り検出で、ウェブデータから高品質な音声/テキストペアを生成できるか。
RQ3強ラベルと弱ラベルの分割は Mandarin ASR の教師ありおよび半教師ありトレーニングにどんな利点をもたらすか。
RQ4Kaldi、ESPnet、WeNet のベースラインは WenetSpeech の評価セットでどの程度性能を示すか。
RQ5現実世界の Mandarin ASR の課題を最もよく反映するベンチマークと評価データセットは何か（Dev、Test_Net、Test_Meeting）。

主な発見

WenetSpeech は 22435 時間の音声を含み、うち 10005 時間が Strong Label データ、2478 時間が Weak Label データ、約 9952 時間が Others と分類。
データ量が増加することで、ツールキット間の MER% の改善が見られ、データ規模の利点を示している（Table 5）。
Kaldi、ESPNet、WeNet の Dev、Test_Net、Test_Meeting における AIShell-1 の MER は、Kaldi が 9.07、12.83、24.72、5.41、ESPNet が 9.70、8.90、15.90、3.90、WeNet が 8.88、9.70、15.59、4.61（Table 5）。
subset L を用いた Kaldi の結果は AIShell-1 で MER が 9.07、12.83、24.72、5.41、データ規模の影響を検証（Table 6）。
このデータセットは、これまでで最大級のオープンソース Mandarin コーパスであり、より一般化された ASR 研究を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。