QUICK REVIEW

[论文解读] Increasing Deep Neural Network Acoustic Model Size for Large Vocabulary Continuous Speech Recognition

Andrew L. Maas, Awni Hannun|arXiv (Cornell University)|Jun 30, 2014

Speech Recognition and Synthesis参考文献 11被引用 20

一句话总结

本文研究了在大规模词汇连续语音识别中，使用分布式GPU架构扩展深度神经网络（DNN）声学模型的性能。研究发现，当训练数据充足时，增加模型规模可显著降低词错误率（WER）——尤其在2,000小时的Fisher语料库上表现明显，证明在训练数据充足的情况下，增大模型规模可直接带来性能提升。

ABSTRACT

Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Part of the promise of DNNs is their ability to represent increas-ingly complex functions as the number of DNN parameters increases. This paper investigates the performance of DNN-based hybrid speech recognition systems as DNN model size and training data increase. Using a distributed GPU architec-ture, we train DNN acoustic models roughly an order of mag-nitude larger than those typically found in speech recognition systems. DNNs of this scale achieve substantial reductions in final system word error rate despite training with a loss func-tion not tightly coupled to system error rate. However, training word error rate improvements do not translate to large improve-ments in test set word error rate for systems trained on the 300 hour Switchboard conversational speech corpus. Scaling DNN acoustic model size does prove beneficial on the Fisher 2,000 hour conversational speech corpus. Our results show that with sufficient training data, increasing DNN model size is an effec-tive, direct path to performance improvements. Moreover, even smaller DNNs benefit from a larger training corpus. Index Terms: speech recognition, neural networks, acoustic modeling

研究动机与目标

研究增加DNN声学模型规模对语音识别性能的影响。
评估在模型规模增大时，使用非与词错误率耦合的损失函数是否仍能带来系统性能提升。
确定更大模型在数据量有限与数据量充足语料库上的性能表现差异。
评估在混合DNN-HMM系统中，模型规模与训练数据规模之间的相互作用。

提出的方法

使用分布式GPU架构训练DNN声学模型，使模型规模相比典型语音识别系统扩大约一个数量级。
采用标准DNN训练目标（未直接针对词错误率优化），以评估模型容量增加时的泛化能力。
在两个语料库上比较性能：300小时的Switchboard语料库和2,000小时的Fisher会话语音数据集。
通过在测试集上测量词错误率（WER）来评估模型规模和训练数据扩展后的系统级性能。
保持混合DNN-HMM架构用于语音识别，重点聚焦于声学模型的改进。

实验结果

研究问题

RQ1在大规模词汇连续语音识别中，增加DNN声学模型规模是否能带来可测量的词错误率降低？
RQ2在训练集上WER的改进在多大程度上能转化为测试集上的性能提升？
RQ3模型扩展的有效性在多大程度上取决于可用训练数据的规模？
RQ4即使使用与系统级错误率无直接关联的损失函数，大型DNN是否仍能实现更好的性能？

主要发现

尽管损失函数未直接针对WER优化，扩大DNN模型规模仍带来了显著的WER降低。
在300小时的Switchboard语料库上，训练集WER的改进未能转化为测试集WER的显著提升，表明数据量限制阻碍了可扩展性的优势。
在2,000小时的Fisher语料库上，增加模型规模显著且可测量地改善了测试集的WER，表明数据容量使模型扩展能够带来性能增益。
即使较小的DNN也从更大的训练语料库中受益，表明数据与模型扩展具有协同效应。
结果证实，当训练数据充足时，增加模型规模是实现性能提升的有效且直接的途径。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。