Skip to main content
QUICK REVIEW

[论文解读] EcoRNN: Efficient Computing of LSTM RNN Training on GPUs

Bojian Zheng, Abhishek Tiwari|arXiv (Cornell University)|May 22, 2018
Advanced Neural Network Applications被引用 5
一句话总结

EcoRNN 提出 Echo,一种基于编译器的优化技术,在不修改源代码的前提下,通过在 LSTM RNN 训练过程中重新计算特征图来减少 GPU 显存占用。通过智能估算内存节省与重新计算开销,Echo 实现了平均 1.89 倍、最高达 3.13 倍的显存减少,从而在固定 GPU 显存限制下实现更快的训练速度、更大的批量大小或更多的网络层数。

ABSTRACT

The Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNNs) are a popular class of machine learning models for analyzing sequential data. Their training on modern GPUs, however, is limited by the GPU memory capacity. Our profiling results of the LSTM RNN-based Neural Machine Translation (NMT) model reveal that feature maps of the attention and RNN layers form the memory bottleneck and runtime is unevenly distributed across different layers when training on GPUs. Based on these two observations, we propose to recompute the feature maps rather than stashing them persistently in the GPU memory. While the idea of feature map recomputation has been considered before, existing solutions fail to deliver satisfactory footprint reduction, as they do not address two key challenges. For each feature map recomputation to be effective and efficient, its effect on (1) the total memory footprint, and (2) the total execution time has to be carefully estimated. To this end, we propose *Echo*, a new compiler-based optimization scheme that addresses the first challenge with a practical mechanism that estimates the memory benefits of recomputation over the entire computation graph, and the second challenge by non-conservatively estimating the recomputation overhead leveraging layer specifics. *Echo* reduces the GPU memory footprint automatically and transparently without any changes required to the training source code, and is effective for models beyond LSTM RNNs. We evaluate *Echo* on numerous state-of-the-art machine learning workloads on real systems with modern GPUs and observe footprint reduction ratios of 1.89X on average and 3.13X maximum. Such reduction can be converted into faster training with a larger batch size, savings in GPU energy consumption (e.g., training with one GPU as fast as with four), and/or an increase in the maximum number of layers under the same GPU memory budget.

研究动机与目标

  • 为解决训练 LSTM RNN 时的 GPU 显存瓶颈,特别是针对神经机器翻译等模型。
  • 在不修改模型源代码的前提下减少显存占用。
  • 在固定 GPU 显存预算下,实现更高效的训练,支持更大的批量大小、更多的网络层或更低的能耗。
  • 克服以往重新计算技术在显存节省与运行时开销之间难以平衡的局限性。

提出的方法

  • Echo 使用基于编译器的方案,自动识别并应用计算图中所有特征图的重新计算。
  • 通过一种实用且非保守的机制,估算在整个计算图中重新计算带来的显存收益。
  • 通过利用各层的特定特征,非保守地建模重新计算的开销,避免不必要的重新计算。
  • 该方法透明且自动,无需对模型源代码进行任何修改。
  • Echo 的设计具有通用性,可扩展至 LSTM RNN 之外的其他深度学习工作负载。

实验结果

研究问题

  • RQ1如何有效应用特征图重新计算来减少 LSTM RNN 训练中的 GPU 显存占用?
  • RQ2在 RNN 模型的不同层中,显存节省与重新计算开销之间的权衡关系如何?
  • RQ3能否通过基于编译器的系统自动且透明地优化显存使用,而无需修改源代码?
  • RQ4与现有重新计算技术相比,该方法在显存减少和性能方面表现如何?

主要发现

  • Echo 在多个最先进的机器学习工作负载上实现了平均 1.89 倍的显存占用减少。
  • 观察到的最大显存减少达到 3.13 倍,显著提升了在固定 GPU 显存限制下的模型训练能力。
  • 该优化使单张 GPU 的训练性能可达到使用四张 GPU 时的吞吐量水平。
  • 该方法通过支持更少 GPU 或更大的批量大小,有效降低了能耗。
  • 该方法不仅适用于 LSTM RNN,也对其他非 RNN 的深度学习模型具有显著效果。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。