QUICK REVIEW

[论文解读] Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Ying Zhang, Mohammad Pezeshki|arXiv (Cornell University)|Jan 10, 2017

Speech Recognition and Synthesis参考文献 27被引用 72

一句话总结

本文提出一个端到端的 CNN-CTC 框架用于语音识别，消除了循环层，在 TIMIT 上展示了有竞争力的音素识别，并且训练速度比 LSTMs 更快。

ABSTRACT

Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.

研究动机与目标

通过利用 CNNs 和 CTC，推动端到端语音识别无需循环网络。
开发一个深层 CNN 架构，通过堆叠卷积和上下文窗口来捕捉时间依赖性。
在 TIMIT 音素识别任务上评估性能，并与基于 LSTM 的基线进行比较。
识别影响性能和训练效率的架构因素（深度、滤波器大小、激活函数）。

提出的方法

设计一个在频谱样本特征上工作的深度 2D 卷积神经网络，并在频率轴上进行池化。
在时间和频率上应用 2D 卷积，使用填充以保持序列长度。
尝试 ReLU、PReLU 和 Maxout 激活，以及在第一层卷积后进行最大池化。
在顶部附加一个 CTC 层，以在没有显式对齐的情况下产生输出序列。
使用 Adam 训练，并在微调时采用 SGD，包括 dropout 和 L2 正则化。
在测试时对 CTC 输出使用最佳路径解码。

实验结果

研究问题

RQ1没有循环层的深度 CNN + CTC 是否能够在 TIMIT 上达到有竞争力的音素识别？
RQ2架构选择（深度、滤波器大小、激活函数）如何影响性能和训练效率？
RQ3在音素层级任务上，CNN-CTC 的训练是否比基于 RNN/LSTM 的端到端方法更快且更稳定？

主要发现

CNN-CTC 模型在 TIMIT 核心测试集上达到 18.2% 的音素错误率，与 LSTM 和转换基线具有竞争力。
更深的体系结构和更大的滤波器尺寸提升了性能，其中 CNN-(3,5)-10L-maxout 在测试 PER 为 18.2%、开发集 PER 为 16.7%（最佳开发集 PER：16.7%）。
Maxout 激活在此设置下优于 ReLU 和 PReLU。
在 TIMIT 上，CNN 模型的训练速度约比同类 LSTM 模型快约 2.5 倍（无额外优化）。
在第一层之后仅在频率轴进行池化可帮助减少频谱变化，而不损害时序分辨率。
正则化（dropout、权重衰减）对小型数据集如 TIMIT 的泛化很重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。