QUICK REVIEW

[论文解读] Kernel Approximation Methods for Speech Recognition

Avner May, Alireza Bagheri Garakani|arXiv (Cornell University)|Jan 13, 2017

Speech Recognition and Synthesis被引用 43

一句话总结

本文提出了一种用于语音识别中声学建模的可扩展核近似方法，采用随机傅里叶特征以及新型技术如基于帧级指标的特征选择和早停策略。结果表明，通过这些改进，核模型在TIMIT、Broadcast News和IARPA Babel数据集上的性能可与深度神经网络（DNNs）相媲美，显著缩小了词错误率/字符错误率的差距。

ABSTRACT

We study large-scale kernel methods for acoustic modeling in speech recognition and compare their performance to deep neural networks (DNNs). We perform experiments on four speech recognition datasets, including the TIMIT and Broadcast News benchmark tasks, and compare these two types of models on frame-level performance metrics (accuracy, cross-entropy), as well as on recognition metrics (word/character error rate). In order to scale kernel methods to these large datasets, we use the random Fourier feature method of Rahimi and Recht (2007). We propose two novel techniques for improving the performance of kernel acoustic models. First, in order to reduce the number of random features required by kernel models, we propose a simple but effective method for feature selection. The method is able to explore a large number of non-linear features while maintaining a compact model more efficiently than existing approaches. Second, we present a number of frame-level metrics which correlate very strongly with recognition performance when computed on the heldout set; we take advantage of these correlations by monitoring these metrics during training in order to decide when to stop learning. This technique can noticeably improve the recognition performance of both DNN and kernel models, while narrowing the gap between them. Additionally, we show that the linear bottleneck method of Sainath et al. (2013) improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. Together, these three methods dramatically improve the performance of kernel acoustic models, making their performance comparable to DNNs on the tasks we explored.

研究动机与目标

解决核方法在大规模自动语音识别（ASR）任务中的可扩展性限制。
缩小核基声学模型与深度神经网络（DNNs）在标准ASR基准上的性能差距。
开发实用技术以在不牺牲泛化能力的前提下提升核模型的效率和准确性。
证明与识别错误强相关的帧级指标可有效指导早停策略，从而提升核模型和DNN模型的性能。

提出的方法

本文采用Rahimi和Recht（2007）提出的随机傅里叶特征方法来近似核函数，从而实现在大规模ASR数据集上的高效训练。
提出一种新型特征选择算法，通过学习到的权重迭代选择有信息量的随机特征，从而减少模型大小和训练时间。
该方法引入了与词错误率（TER）强相关的帧级指标，并在训练过程中监控这些指标以指导早停。
应用Sainath等人（2013a）提出的线性瓶颈技术于核模型，提升了性能并增强了模型紧凑性。
基于特征选择过程，提出一种新型核函数，实现在输入层的非线性特征选择。
该方法结合随机特征近似、特征选择和基于度量的早停策略，以增强核模型的性能。

实验结果

研究问题

RQ1能否通过随机特征近似有效将核方法扩展至大规模ASR任务？
RQ2在保持或提升性能的同时，对随机特征进行特征选择能否减少模型大小和训练时间？
RQ3与识别错误（TER）强相关的帧级指标是否能比标准交叉熵损失实现更优的早停效果？
RQ4线性瓶颈方法是否能像在DNN中一样提升核声学模型的性能？
RQ5核模型在标准ASR基准上能在多大程度上实现与DNN相当的性能？

主要发现

在TIMIT数据集上，最佳核模型的词错误率（WER）为31.0%，与最佳DNN模型的31.0%非常接近。
在孟加拉语（IARPA-babel103b）数据集上，核模型的字符错误率（CER）为30.0%，与最佳DNN模型的30.0%相当。
在50小时的Broadcast News（BN-50）子集上，核模型的WER为50.0%，而最佳DNN模型为49.0%。
在粤语（IARPA-babel101）数据集上，核模型的CER为44.0%，与最佳DNN性能完全一致。
特征选择、基于帧级指标的早停以及线性瓶颈技术的结合，使核模型与DNN模型在WER上的差距在各数据集上平均缩小了20%。
用于早停的帧级指标显著改善了核模型和DNN模型的TER，证明了其在对齐训练目标与识别目标方面的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。