QUICK REVIEW

[논문 리뷰] Memory-Efficient Implementation of DenseNets

Geoff Pleiss, Danlu Chen|arXiv (Cornell University)|2017. 07. 21.

Advanced Neural Network Applications참고 문헌 12인용 수 86

한 줄 요약

본 논문은 재계산과 메모리 공유 전략을 도입하여 feature-map 메모리를 2차원에서 선형으로 감소시키고, 이로써 수백 층에 달하는 DenseNets를 비교적 작은 시간 오버헤드로 학습할 수 있도록 한다.

ABSTRACT

The DenseNet architecture is highly computationally efficient as a result of feature reuse. However, a naive DenseNet implementation can require a significant amount of GPU memory: If not properly managed, pre-activation batch normalization and contiguous convolution operations can produce feature maps that grow quadratically with network depth. In this technical report, we introduce strategies to reduce the memory consumption of DenseNets during training. By strategically using shared memory allocations, we reduce the memory cost for storing feature maps from quadratic to linear. Without the GPU memory bottleneck, it is now possible to train extremely deep DenseNets. Networks with 14M parameters can be trained on a single GPU, up from 4M. A 264-layer DenseNet (73M parameters), which previously would have been infeasible to train, can now be trained on a single workstation with 8 NVIDIA Tesla M40 GPUs. On the ImageNet ILSVRC classification dataset, this large DenseNet obtains a state-of-the-art single-crop top-1 error of 20.26%.

연구 동기 및 목표

고용량 모델을 위한 DenseNet의 매개변수 효율성과 특징 재사용의 동기를 제시합니다.
표준 DenseNet 학습에서의 제곱 규모의 메모리 병목을 식별합니다.
학습 메모리를 제곱에서 선형으로 줄이기 위한 메모리 공유 전략을 제안합니다.
메모리 예산 하에서 매우 깊은 DenseNets의 학습과 경쟁력 있는 ImageNet 성능을 시연합니다.

제안 방법

DenseNets에서 제곱 메모리의 두 원천을 식별합니다: 사전 활성화 배치 정규화와 연속 연결(contiguous concatenation).
연속 연결 출력에 대한 공유 메모리 저장소 1(Shared Memory Storage 1)와 배치 정규화 출력에 대한 공유 메모리 저장소 2(Shared Memory Storage 2)를 도입합니다.
역전파 동안 연속 연결과 배치 정규화를 재계산하여 모든 중간 산출물을 저장하는 대신 공유 저장소를 채웁니다.
그라디언트 저장소를 계층 간에 공유하여 그라디언트의 제곱 증가를 피합니다.
메모리 및 시간 오버헤드를 측정하여 약 15-20%의 추가 학습 시간과 함께 상당한 메모리 절감을 보여줍니다.

실험 결과

연구 질문

RQ1공유 저장소 재사용과 재계산으로 메모리를 감소시킨 상태에서 DenseNets를 효과적으로 학습할 수 있는가?
RQ2얼마나 많은 메모리를 절약할 수 있는가(제곱에서 선형으로)와 그 계산 비용은 어느 정도인가?
RQ3메모리 효율 DenseNets로 ImageNet에서 달성 가능한 깊이와 매개변수 수의 현실적 한계는 무엇인가?

주요 결과

제안된 공유 메모리 전략으로 메모리 소모가 깊이에 대해 선형으로 변합니다.
LuaTorch에서는 160층 모델이 순수 구현의 약 22%의 메모리를 사용하여 12 GB 예산 내에서 약 340층 모델 학습이 가능하게 합니다.
PyTorch에서는 매우 깊은 DenseNets를 단일 GPU에서 거의 500층까지 학습하는 것이 가능하다.
효율적 구현으로 학습된 DenseNets는 ImageNet에서 264 층(k=48, 73M 매개변수)으로 top-1 오차 20.26%를 달성합니다.
가장 깊은 코사인 DenseNet은 top-1 오차 20.26%로 이전 최첨단을 능가합니다.
그라디언트 저장소 공유는 시간 비용 없이 유익하며; 공유 BN/concat 저장소를 추가하면 약 15-20%의 시간 오버헤드가 추가됩니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.