QUICK REVIEW

[论文解读] AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale

Jiayu Du, Xingyu Na|arXiv (Cornell University)|Aug 31, 2018

Speech Recognition and Synthesis参考文献 8被引用 201

一句话总结

AISHELL-2 提供 1000 小时 iPhone 记录的普通话语音以及 Kaldi 基于的、端到端的工业规模 ASR 方案，包括语言处理、特征管线和 LFMMI TDNN 模型，具备多通道评估数据。

ABSTRACT

AISHELL-1 is by far the largest open-source speech corpus available for Mandarin speech recognition research. It was released with a baseline system containing solid training and testing pipelines for Mandarin ASR. In AISHELL-2, 1000 hours of clean read-speech data from iOS is published, which is free for academic usage. On top of AISHELL-2 corpus, an improved recipe is developed and released, containing key components for industrial applications, such as Chinese word segmentation, flexible vocabulary expension and phone set transformation etc. Pipelines support various state-of-the-art techniques, such as time-delayed neural networks and Lattic-Free MMI objective funciton. In addition, we also release dev and test data from other channels(Android and Mic). For research community, we hope that AISHELL-2 corpus can be a solid resource for topics like transfer learning and robust ASR. For industry, we hope AISHELL-2 recipe can be a helpful reference for building meaningful industrial systems and products.

研究动机与目标

提供一个大规模、开放的普通话 ASR 语料库，供学术研究和面向产业的基线使用。
交付一个基于 Kaldi 的端到端 ASR 方案，包含词典、分词/切分和语言建模。
在多种声学通道（iOS、Android、Mic）上展示性能，并建立可扩展的训练流程。
鼓励在工业环境中开展普通话 ASR 的迁移学习与鲁棒性研究。

提出的方法

发布 AISHELL-2 1000 小时的 iOS 记录普通话朗读语料，并提供多通道开发/测试数据。
使用 DaCiDian 字典与 Jieba 分词工具包开发普通话分词管线。
GMM-HMM 初始训练，随后进行带 LFMMI 目标和 i-vector 条件化的 TDNN 声学建模。
使用在 570 万词上训练、采用 Kneser-Ney 平滑的三元 ARPA 语言模型。
基于 Kaldi 的自包含基线配方，涵盖数据准备、词汇表、语言模型、GMM-DNN 训练与评估。

实验结果

研究问题

RQ1大规模普通话 ASR 语料库如何推动产业规模的系统开发与研究？
RQ2在多通道声学条件下，采用 TDNN-LFMMI 系统和 i-vector 条件化能获得哪些性能提升？
RQ3分词、词典设计（DaCiDian）以及灵活的音素映射如何影响普通话 ASR 的识别准确率？
RQ4AISHELL-2 是否能够促进面向工业的普通话 ASR 流水线的迁移学习与鲁棒性研究？

主要发现

模型	dev_android_CER	dev_ios_CER	dev_mic_CER	test_android_CER	test_ios_CER	test_mic_CER	训练时间_小时
Mono	47.08	43.37	47.33	45.40	44.81	44.28	0.5
tri1	26.61	22.94	26.55	26.08	24.79	25.36	1
tri2	24.59	21.47	24.59	23.82	22.69	23.37	2
tri3(LDA+MLLT)	22.24	18.86	22.47	21.00	19.77	21.10	2.5
Chain-TDNN	10.43	9.10	11.84	9.59	8.81	10.87	15

链式 TDNN 系统在所有通道上相较基线实现显著的 CER 降幅，分别为 10.43%（dev_android），9.10%（dev_ios），11.84%（dev_mic），9.59%（test_android），8.81%（test_ios），以及 10.87%（test_mic）。
从 tri1 到 tri3（LDA+MLLT）的连续改进在各通道显著降低 CER，最终在 2.5 个训练小时内达到 21.00% 的 test_android CER 和 21.10% 的 test_mic CER。
基线的 mono 与 tri1/tri2 配置显示出持续的准确率提升，突显特征变换与 LFMMI 优化的好处。
AISHELL-2 提供对 1000 小时 iOS 数据以及来自 iOS、Android 和 Mic 通道的开发/测试数据的开放访问，并附有完整文档化的 Kaldi 配方以保证可重复性。
结果表明 iOS 数据在跨通道上具有显著的性能优势，并支持工业规模普通话 ASR 流水线的可行性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。