QUICK REVIEW

[논문 리뷰] Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Weijie Zhao, Deping Xie|arXiv (Cornell University)|2020. 03. 12.

Advanced Image and Video Retrieval Techniques참고 문헌 49인용 수 76

한 줄 요약

논문은 terabyte 규모의 희소 CTR 모델을 학습하기 위한 분산 계층형 GPU 매개변수 서버(HBM-PS MEM-PS SSD-PS)를 제시하여, MPI 클러스터 대비 학습 속도 1.8–4.8배 증가 및 가격-성능 4–9배 향상.

ABSTRACT

Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.

연구 동기 및 목표

단일 노드에서 GPU 메모리와 CPU 메모리를 모두 초과하는 초대형 CTR 모델을 학습해야 하는 필요성을 제시한다.
HBM, 메모리, SSD의 세 가지 계층으로 구성된 계층형 저장 설계를 제안하여 대규모 희소 모델의 GPU 중심 학습을 가능하게 한다.
학습 속도를 높이기 위한 노내 및 노드 간 GPU 매개변수 동기화를 효율적으로 개발한다.
실제 광고 데이터 세트에서 확장성을 평가하고 표준 MPI 클러스터 기준선과 비교한다.

제안 방법

데이터 전송, 매개변수 로딩 및 GPU 계산을 겹치도록 4단계 파이프라인을 설계한다.
GPU들에 걸친 HBM에 다중 GPU 분산 해시 테이블을 구현하여 작업 매개변수를 원자적 업데이트로 저장한다.
모든 합산(all-reduce) 연산을 통한 노드 간 GPU 매개변수 동기화를 위해 RDMA를 사용한다.
SSD에 파일 단위로 매개변수를 클러스터링하고 파일 수준 매개변수 관리와 백그라운드 컴팩션으로 오래된 데이터를 관리한다.
키를 저장 위치에 매핑하기 위해 모듈로 해싱으로 GPU와 노드 전반에 걸쳐 매개변수를 분할한다.

실험 결과

연구 질문

RQ1계층형 GPU 매개변수 서버가 정확도를 해치지 않으면서 terabyte 규모의 CTR 모델을 효율적으로 학습하게 할 수 있는가?
RQ2전통적인 MPI 기반 학습과 비교하여 HBM-PS, MEM-PS, SSD-PS를 통합했을 때의 성능 및 비용 이점은 무엇인가?
RQ3실제 광고 데이터에서 데이터 전송, 캐싱 및 I/O 전략이 전체 학습 처리량에 어떤 영향을 미치는가?

주요 결과

4-노드 계층형 GPU 매개변수 서버가 5개 CTR 모델에 걸쳐 MPI-클러스터 기준선 대비 1.8–4.8x의 학습 속도향상을 달성한다.
비용 표준화 속도향상은 MPI 솔루션과 비교해 4.4x에서 9.0x까지 범위이다.
계층형 시스템의 상대 AUC 정확도는 MPI 기준선의 0.1% 이내이며 일부 모델은 이를 약간 상회하여 손실 없는 학습을 시사한다.
HBM-PS는 pull/push 연산이 비제로 특징의 수에 비례해 확장되고, 학습 시간은 밀집 매개변수 수에 비례해 확장한다.
MEM-PS 및 SSD-PS는 캐싱과 파일 수준 매개변수 관리로 SSD I/O 영향을 줄여 메인 메모리 밖 학습을 가능하게 한다.
실험은 4개의 GPU 노드(노드당 8×32 GB HBM)와 5개 CTR 모델(희소 매개변수 8e9~1e11)을 사용하여 확장성과 효율성을 입증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.