QUICK REVIEW

[논문 리뷰] Training Large Neural Networks with Constant Memory using a New Execution Algorithm

Bharadwaj Pudipeddi, Maral Mesmakhosroshahi|arXiv (Cornell University)|2020. 02. 13.

Ferroelectric and Negative Capacitance Devices참고 문헌 15인용 수 24

한 줄 요약

이 논문은 L2L(Layer-to-Layer)를 소개한다. L2L은 전체 모델을 CPU 기반의 이른 처리 파라미터 서버(EPS)로 오프로드함으로써 일정한 메모리로 대규모 신경망을 훈련시키는 새로운 실행 알고리즘이다. 이 알고리즘은 오직 현재 레이어의 파라미터와 활성화 값만 GPU 메모리에 유지한다. 이 방법은 최신 기준 대비 45% 낮은 메모리 사용량과 40% 높은 처리량을 달성하여, 단일 16GB V100 GPU와 512GB CPU 메모리에서 모델 분할 없이 배치 크기 제한 없이 500억 파라미터 모델을 훈련시킬 수 있게 한다.

ABSTRACT

Widely popular transformer-based NLP models such as BERT and Turing-NLG have enormous capacity trending to billions of parameters. Current execution methods demand brute-force resources such as HBM devices and high speed interconnectivity for data parallelism. In this paper, we introduce a new relay-style execution technique called L2L (layer-to-layer) where at any given moment, the device memory is primarily populated only with the executing layer(s)'s footprint. The model resides in the DRAM memory attached to either a CPU or an FPGA as an entity we call eager param-server (EPS). To overcome the bandwidth issues of shuttling parameters to and from EPS, the model is executed a layer at a time across many micro-batches instead of the conventional method of minibatches over whole model. L2L is implemented using 16GB V100 devices for BERT-Large running it with a device batch size of up to 256. Our results show 45% reduction in memory and 40% increase in the throughput compared to the state-of-the-art baseline. L2L is also able to fit models up to 50 Billion parameters on a machine with a single 16GB V100 and 512GB CPU memory and without requiring any model partitioning. L2L scales to arbitrary depth allowing researchers to develop on affordable devices which is a big step toward democratizing AI. By running the optimizer in the host EPS, we show a new form of mixed precision for faster throughput and convergence. In addition, the EPS enables dynamic neural architecture approaches by varying layers across iterations. Finally, we also propose and demonstrate a constant memory variation of L2L and we propose future enhancements. This work has been performed on GPUs first, but also targeted towards all high TFLOPS/Watt accelerators.

연구 동기 및 목표

표준 GPU의 용량을 초과하는 대규모 트랜스포머 모델(예: BERT, GPT-3)의 증가하는 메모리 및 컴퓨팅 요구를 해결하기 위해.
고대역폭 메모리(HBM) 장치나 모델 분할 없이도 저비용 하드웨어에서 수십억 파라미터 모델을 훈련시킬 수 있도록 하기 위해.
모델 깊이에 관계없이 임의의 깊이까지 확장 가능한 일정한 메모리 실행 방법을 개발하기 위해.
모델 가중치와 옵티마이저 상태를 CPU 기반의 이른 처리 파라미터 서버(EPS)로 이동시켜 메모리 압박을 줄이고, 레이어를 순차적으로 실행함으로써 처리량을 향상시키기 위해.
GPU와 CPU 간의 새로운 저오버헤드 파라미터 전송 메커니즘을 통해 혼합 정밀도 훈련과 효율적인 데이터 병렬 처리를 가능하게 하기 위해.

제안 방법

L2L는 릴레이 스타일의 실행 방식을 사용하며, 오직 현재 레이어의 파라미터와 활성화 값만 GPU 메모리에 저장하고, 전체 모델은 CPU 또는 FPGA의 DRAM에 있는 이른 처리 파라미터 서버(EPS)에 유지한다.
EPS는 실행 전에 다음 레이어의 파라미터를 사전 로드하고 전송하여 유휴 시간을 최소화하고, 내부 루프를 통해 전송 빈도를 감소시킨다.
이 방법은 전체 미니배치가 아닌 마이크로배치를 레이어 간 순차적으로 처리함으로써 메모리 프로파일을 감소시키고, 모델 깊이에 관계없이 일정한 메모리 사용량을 유지한다.
EPS는 GPU 계산과 병행하여 기울기 감소와 가중치 업데이트를 처리함으로써, 더 빠른 수렴을 보장하는 새로운 혼합 정밀도 훈련 방식을 가능하게 한다.
미래 확장인 L2Lp는 EPS에서 완전히 병렬로 기울기 감소와 가중치 업데이트를 수행하며, 고속 NVLink는 오직 다음 레이어를 로드할 때만 사용되어 대역폭 의존도를 최소화한다.
레이어가 독립적으로 실행되고 실시간으로 수정 가능하므로, 이 방법은 동적 신경망 아키텍처 탐색을 지원한다.

실험 결과

연구 질문

RQ1표준 GPU에서 모델을 CPU 기반 파라미터 서버로 오프로드함으로써 대규모 트랜스포머 모델을 일정한 메모리로 훈련시킬 수 있는가?
RQ2마이크로배치 기반의 레이어 단위 실행 방식은 기존의 미니배치 훈련 방식과 비교해 메모리 사용량과 처리량 측면에서 어떻게 다를까?
RQ3L2L 방법은 단일 16GB V100에서도 메모리가 부족해지지 않고 384층과 같은 매우 깊은 모델까지 확장 가능한가?
RQ4EPS 기반 아키텍처는 최적화된 파라미터 전송과 혼합 정밀도 훈련을 통해 얼마나 더 빠른 수렴과 높은 처리량을 가능하게 하는가?
RQ5L2L는 반복마다 레이어를 수정할 수 있도록 하여 재컴파일이나 재구성 없이도 동적 신경망 아키텍처 탐색을 지원할 수 있는가?

주요 결과

L2L는 단일 16GB V100 GPU에서 BERT-Large를 훈련할 때 최신 기준 대비 45% 낮은 GPU 메모리 사용량을 기록한다.
이 방법은 메모리 압박을 줄이면서도 기준 대비 40% 높은 훈련 처리량을 제공한다.
L2L는 단일 16GB V100에서 최대 256개의 장치 배치 크기로 BERT-Large를 성공적으로 훈련시켰으며, 이는 기존 기준이 배치 크기 2에서 곤경에 빠지는 것과 비교해 상당한 향상이다.
이 방법은 모델 분할 없이도 단일 16GB V100과 512GB CPU 메모리에서 최대 500억 파라미터 모델을 훈련시킬 수 있으며, 메모리 부족 오류 없이 작동한다.
모델 깊이에 관계없이 L2L는 일정한 메모리 사용량을 유지하며, 384층까지의 모델을 메모리 오버플로 없이 훈련시킬 수 있다.
검증 곡선은 L2L가 FP32 및 혼합 정밀도 모드에서 모두 기준보다 더 빠르게 수렴하는 것으로 나타내어 훈련 효율성이 향상되었음을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.