QUICK REVIEW

[论文解读] Attention-Based End-to-End Speech Recognition in Mandarin.

Changhao Shan, Junbo Zhang|arXiv (Cornell University)|Jul 22, 2017

Speech Recognition and Synthesis参考文献 18被引用 6

一句话总结

本文提出一种基于注意力机制的端到端中文语音识别模型，采用字符嵌入，并通过L2正则化、权重噪声和帧跳过等训练优化技术，应对中文表意文字系统和大词汇量带来的挑战。在MiTV语音搜索数据集上，结合三元语言模型，该模型实现了2.81%的字符错误率（CER）和5.77%的句子错误率（SER）。

ABSTRACT

Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.

研究动机与目标

解决由于中文表意文字书写系统和大词汇量导致的端到端注意力机制模型在中文语音识别中应用困难的问题。
克服训练过程中注意力机制中条件依赖关系带来的挑战。
通过有效的训练技术提升模型的鲁棒性与收敛性。
在中文语音搜索任务中，通过字符级输出实现最先进性能。

提出的方法

采用字符嵌入以应对中文表意文字系统固有的大词汇量问题。
应用L2正则化和高斯权重噪声以稳定训练过程并减少过拟合。
实施帧跳过以降低计算负载并提升训练效率。
对比两种注意力机制，并采用注意力平滑技术以增强长距离上下文建模能力。
集成三元语言模型以进一步提升识别准确率。
采用编码器-解码器架构结合注意力机制，直接将语音特征映射为字符序列。

实验结果

研究问题

RQ1尽管中文书写系统复杂且词汇量庞大，基于注意力机制的端到端模型是否能有效识别中文语音？
RQ2如权重噪声和帧跳过等训练技术如何影响模型的收敛性与性能？
RQ3在中文语音识别中，不同注意力机制在建模长距离依赖关系方面的相对有效性如何？
RQ4注意力平滑在注意力机制中对上下文建模的改善程度如何？
RQ5在中文端到端语音识别中，引入语言模型能在多大程度上降低错误率？

主要发现

在MiTV语音搜索数据集上，该模型在未使用语言模型的情况下，字符错误率（CER）为3.58%。
加入三元语言模型后，CER降低至2.81%，证明了语言建模的有效性。
未使用语言模型时，句子错误率（SER）为7.43%，加入三元语言模型后下降至5.77%。
字符嵌入、L2正则化与帧跳过技术的结合显著提升了训练稳定性和模型性能。
注意力平滑技术使注意力机制能够更好地建模长距离上下文依赖关系。
通过架构与训练创新，该模型成功应对了中文表意文字系统和大词汇量带来的挑战。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。