QUICK REVIEW

[论文解读] Achieving Human Parity in Conversational Speech Recognition

Wayne Xiong, Jasha Droppo|arXiv (Cornell University)|Oct 17, 2016

Speech Recognition and Synthesis参考文献 56被引用 478

一句话总结

该论文在 NIST 2000 CTS 上衡量了人工转录误差，并证明基于 CNN/LSTM 的声学模型结合 LFMMI、先进的语言模型和系统融合，在 Switchboard 和 CallHome 任务上接近人类或达到人类水平的词错误率（WER）。

ABSTRACT

Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion where friends and family members have open-ended conversations. In both cases, our automated system establishes a new state of the art, and edges past the human benchmark, achieving error rates of 5.8% and 11.0%, respectively. The key to our system's performance is the use of various convolutional and LSTM acoustic model architectures, combined with a novel spatial smoothing method and lattice-free MMI acoustic training, multiple recurrent neural network language modeling approaches, and a systematic use of system combination.

研究动机与目标

量化 NIST 2000 Switchboard 和 CallHome 数据集上的人工转录误差。
开发并优化用于对话语音识别的 CNN/LSTM 声学模型。
结合 lattice-free MMI 训练和高级语言模型以提高 WER。
评估系统融合方法以最大化互补增益。
将机器性能与在相同测试集上的专业人工转录进行比较。

提出的方法

训练带 i-vector 说话人自适应的 CNN 变体（VGG、ResNet、LACE）和 BLSTM/LSTM 声学模型。
将空间平滑作为正则化项应用于声学激活以提升 BLSTM 性能。
使用混合历史声学单元语言模型进行 LFMMI 训练。
使用大型未剪枝的 N-gram LM 与神经 LM（RNN-LM 和 LSTM-LM）进行重评分，包括前向与后向模型。
通过混淆网络方法进行系统融合，并采用贪婪选择与权重优化以最大化互补增益。
使用 CNTK 进行可扩展多 GPU 训练，并采用 1-bit SGD 进行高效分布式优化。

实验结果

研究问题

RQ1在 NIST eval 2000 Switchboard（SWB）与 CallHome（CH）部分的人工转录误差率是多少？
RQ2是否可以在这些 CTS 基准上，通过基于 CNN/LSTM 的声学模型、LFMMI 训练、i-vector 自适应和先进语言建模，超过人类水平？
RQ3空间平滑、i-vector 条件化与无 lattice 训练对 WER 的降低各自贡献多少？
RQ4系统融合与 LM 重新评分对整体性能有什么影响？
RQ5在使用多种神经网络结构和重新评分策略时，机器性能能在多大程度接近对话 CTS 的人类性能？

主要发现

Model	N-gram LM	RNN-LM	LSTM-LM	CH WER (%)	SWB WER (%)
ResNet, 300h training	19.2	-	-	CH: 19.2	SWB: 10.0
ResNet	14.8	-	-	CH: 14.8	SWB: 8.6
ResNet, GMM alignments	15.3	-	-	CH: 15.3	SWB: 8.8
VGG	15.7	-	-	CH: 15.7	SWB: 9.1
VGG + ResNet	14.5	-	-	CH: 14.5	SWB: 8.4
LACE	15.0	-	-	CH: 15.0	SWB: 8.4
BLSTM	16.5	-	-	CH: 16.5	SWB: 9.0
BLSTM, spatial smoothing	15.4	-	-	CH: 15.4	SWB: 8.6
BLSTM, spatial smoothing, 27k senones	15.3	-	-	CH: 15.3	SWB: 8.3
BLSTM, spatial smoothing, 27k senones, alternate dictionary	14.9	-	-	CH: 14.9	SWB: 8.3
BLSTM system combination	13.2	-	-	CH: 13.2	SWB: 7.3
Full system combination	13.0	-	-	CH: 13.0	SWB: 7.3

在 NIST 2000 Switchboard 上的人为误差为 5.9%，CallHome 为 11.3%，以专业转录员完成。
自动系统在 Switchboard 上达到 5.8% 的 WER，在 CallHome 上为 11.0%，比人类表现略有领先。
空间平滑在早期 BLSTM 实验中将 WER 相对降低约 5–10%。
i-vector 说话人自适应配合 LFMMI 训练在各模型中带来额外相对 7–10% 的 WER 降幅。
最终的多 BLSTM 变体和声学模型系统融合达到 CH 11.0% 与 SWB 5.8%，达到或超过人类基准。
500-best ResNet 假设的 Oracle WER 为 SWB 2.7% 和 CH 4.9%，表明通过解码/搜索改进仍有进一步提升的空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。