QUICK REVIEW

[论文解读] Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu|arXiv (Cornell University)|Apr 21, 2016

Advanced Neural Network Applications参考文献 19被引用 538

一句话总结

论文提出一种基于计算图的方法，用以以子线性内存训练深度网络，通过用额外前向传播来换取内存，从而在n层网络中达到O(sqrt(n))内存，甚至在极端情况下达到O(log n)。在非常深的ResNet和长序列的LSTM上进行了验证。

ABSTRACT

We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

研究动机与目标

在训练过程中通过存储更少的中间特征图和梯度来降低内存占用。
开发基于计算图的内存优化，具备就地操作与内存共享。
引入一个受控的计算换取内存的权衡，以便训练更深的网络。
为在深度学习框架中集成内存优化技术提供可行指南和开放源代码计划。

提出的方法

分析计算图以实现中间结果的就地操作和内存共享。
开发梯度图构建方法（镜像计数）以在反向传播中重新计算被丢弃的中间结果。
通过将网络划分为k个段并存储段输出，提出O(sqrt(n))的内存计划，额外需要前向传播。
将该方法推广到任意图，通过带预算计划的内存优化梯度图（Alg. 2）（Alg. 3）。
提供基于递归的视角，展示如何通过递归细分将内存降低到O(log n)（k=1情形）。
为在现有DL框架中实现这些技术提供框架性指南和开源实现计划。

实验结果

研究问题

RQ1在保持训练正确性的前提下，是否可以以子线性内存存储中间特征图和梯度？
RQ2给定网络深度和内存预算，最佳的内存与计算权衡是什么？
RQ3如何分析计算图以在训练中实现就地操作和内存共享？
RQ4该方法是否可以扩展到非常深的结构（如1000+层）以及长序列模型（如LSTM）且开销可控？
RQ5将内存优化融入当前DL框架的实际指南是什么？

主要发现

内存成本可以从线性降到子线性，达到n层网络的O(sqrt(n))内存，每个小批量仅额外增加一次前向传播。
在极端情况下，内存使用可以降至O(log n)，额外前向计算仅为O(n log n)。
实验表明在ImageNet上，提出的方法将1000层ResNet的内存从48G降至7G。
在长序列的复杂RNN（LSTM）训练中也观察到显著的内存下降。
子线性内存计划相较线性内存分配大约增加30%的慢downs量，这是为获得巨大内存节省所带来的适度开销。
该方法与现有框架兼容，并且可以与其他内存优化方法结合使用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。