QUICK REVIEW

[论文解读] Adding Gradient Noise Improves Learning for Very Deep Networks

Arvind Neelakantan, Luke Vilnis|arXiv (Cornell University)|Nov 21, 2015

Domain Adaptation and Few-Shot Learning参考文献 31被引用 265

一句话总结

本文提出在随机梯度下降过程中添加退火高斯梯度噪声，以改善非常深的神经网络的训练。该方法通过鼓励参数空间的探索来增强优化，使从较差初始化成功训练20层全连接网络成为可能，并在问答任务上实现72%的相对误差降低，同时在7,000次随机重启中将准确的二进制乘法模型数量翻倍。

ABSTRACT

Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long short-term memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than more basic architectures. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we discuss a low-overhead and easy-to-implement technique of adding gradient noise which we find to be surprisingly effective when training these very deep architectures. The technique not only helps to avoid overfitting, but also can result in lower training loss. This method alone allows a fully-connected 20-layer deep network to be trained with standard gradient descent, even starting from a poor initialization. We see consistent improvements for many complex models, including a 72% relative reduction in error rate over a carefully-tuned baseline on a challenging question-answering task, and a doubling of the number of accurate binary multiplication models learned across 7,000 random restarts. We encourage further application of this technique to additional complex modern architectures.

研究动机与目标

解决神经图灵机和记忆网络等非常深且复杂的神经架构中的优化挑战。
克服在较差初始化下训练深层前馈网络和循环网络的困难。
在不同随机初始化和超参数设置下，提升泛化能力和鲁棒性。
探索一种低开销、易于实现的技巧，在不改变网络架构的前提下提升训练效果。
在包括算法学习和问答任务在内的多种复杂模型中，展示一致的性能提升。

提出的方法

在反向传播过程中梯度裁剪后，向梯度添加均值为零的高斯噪声。
对噪声方差使用退火调度，根据预定义的衰减函数随时间逐渐减小。
在标准随机梯度下降更新步骤中对梯度应用噪声。
保持与基线模型相同的优化超参数（如学习率、批量大小）。
通过一行代码实现噪声注入，使其具有高度实用性和可部署性。
实验中使用Adam优化器，噪声在梯度裁剪后应用以稳定更新。

实验结果

研究问题

RQ1在非常深的前馈网络和循环网络中，添加梯度噪声是否能提升训练稳定性和收敛性？
RQ2梯度噪声是否能增强在多个随机初始化下的泛化能力和鲁棒性？
RQ3梯度噪声是否能帮助训练从较差初始化开始、标准SGD失败的深层网络？
RQ4梯度噪声在问答和算法学习等复杂任务上的表现如何？
RQ5梯度噪声的退火调度是否相比恒定噪声或非退火噪声能带来可测量的改进？

主要发现

添加退火梯度噪声使得即使从较差初始化出发，也能使用标准随机梯度下降成功训练20层修正的全连接网络（在MNIST上）。
在一项具有挑战性的问答任务中，该方法相比精心调优的基线实现了72%的相对误差降低。
在二进制乘法任务的大规模实验中（共7,290次随机重启），使用梯度噪声训练的模型获得的准确结果（误差 < 1%）超过基线的两倍。
该方法在超参数设置和初始化之间表现出更强的鲁棒性，在第k个元素任务中，使用噪声的训练成功率为11.3%，而无噪声时仅为1.3%。
梯度噪声降低了训练损失并改善了泛化性能，表明其有助于在复杂损失曲面中逃离不良局部极小值。
该技术在多种架构中均一致地提升了性能，包括全连接网络、神经GPU和问答模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。