QUICK REVIEW

[论文解读] Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Jaeyoung Kim, Mostafa El‐Khamy|arXiv (Cornell University)|Jan 10, 2017

Speech Recognition and Synthesis被引用 34

一句话总结

本文提出残差LSTM，一种深层循环架构，通过在输出层之间引入空间捷径连接，以改善深层LSTM在远距离语音识别中的训练。通过复用LSTM输出门和投影矩阵，而非添加新门，该方法将参数减少10%以上，并在10层网络下于AMI SDM语料库上实现了41.0%的SOTA词错误率（WER），优于普通LSTM和高速公路LSTM，后者在深度增加时出现了训练退化现象。

ABSTRACT

In this paper, a novel architecture for a deep recurrent neural network, residual LSTM is introduced. A plain LSTM has an internal memory cell that can learn long term dependencies of sequential data. It also provides a temporal shortcut path to avoid vanishing or exploding gradients in the temporal domain. The residual LSTM provides an additional spatial shortcut path from lower layers for efficient training of deep networks with multiple LSTM layers. Compared with the previous work, highway LSTM, residual LSTM separates a spatial shortcut path with temporal one by using output layers, which can help to avoid a conflict between spatial and temporal-domain gradient flows. Furthermore, residual LSTM reuses the output projection matrix and the output gate of LSTM to control the spatial information flow instead of additional gate networks, which effectively reduces more than 10% of network parameters. An experiment for distant speech recognition on the AMI SDM corpus shows that 10-layer plain and highway LSTM networks presented 13.7% and 6.2% increase in WER over 3-layer aselines, respectively. On the contrary, 10-layer residual LSTM networks provided the lowest WER 41.0%, which corresponds to 3.3% and 2.8% WER reduction over plain and highway LSTM networks, respectively.

研究动机与目标

为解决由于时间域和空间域中梯度消失/爆炸而导致的深层循环网络训练难题。
提升深层LSTM在远距离语音识别中的性能，其中长期依赖性和模型深度至关重要。
通过消除高速公路LSTM架构中使用的冗余门网络，降低模型复杂度。
使更深的网络（如10层）能够更好地泛化，避免普通LSTM和高速公路LSTM中观察到的性能退化。
探究复用现有LSTM组件（输出门和投影矩阵）用于捷径路径是否能提升训练稳定性和效率。

提出的方法

在相邻输出层之间引入空间捷径路径，而非使用内部记忆单元，从而解耦空间与时间的梯度流。
复用现有的LSTM输出门和投影矩阵以控制捷径路径中的信息流，避免引入额外可学习参数。
设计残差连接，使得每一层学习相对于捷径的残差映射，从而简化优化过程。
在输出层级别应用残差连接，实现类似恒等映射的信息绕过，而无需新增门网络。
采用标准LSTM单元结构，但修改捷径连接逻辑，以确保梯度在深度方向上保持流动。
使用标准反向传播进行模型训练，残差连接使10层网络的训练依然稳定。

实验结果

研究问题

RQ1在远距离语音识别的深层循环网络中，输出层之间的残差连接是否能提升训练稳定性和性能？
RQ2复用现有LSTM组件（输出门和投影矩阵）用于捷径路径，是否能在保持或提升性能的同时降低模型复杂度？
RQ3在WER和训练收敛性方面，深层残差LSTM与普通LSTM和高速公路LSTM相比表现如何？
RQ4残差LSTM是否能避免在深度增加时普通LSTM和高速公路LSTM中观察到的性能退化？
RQ5残差架构是否能通过更深的网络实现更好的泛化能力，尤其是在训练数据增加时？

主要发现

10层残差LSTM在AMI SDM语料库上实现了最低的WER 41.0%，相比3层普通LSTM基线相对降低了3.3%。
10层残差LSTM相比3层基线降低了2.2% WER，而10层普通LSTM在非重叠WER上退化了13.7%。
10层高速公路LSTM相比3层基线增加了6.2% WER，表明深度增加导致了训练退化。
通过复用现有门而非新增门，残差LSTM相比高速公路LSTM将网络参数减少了10%以上。
在结合SDM和IHM训练数据时，10层残差LSTM实现了39.3% WER，比最佳的5层高速公路LSTM降低了3.1%。
随着深度增加，残差LSTM在验证数据上的交叉熵损失改善，表明泛化能力更强且无训练损失，而普通LSTM和高速公路LSTM则并非如此。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。