QUICK REVIEW

[논문 리뷰] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Wei Gao, Qinghao Hu|arXiv (Cornell University)|2022. 05. 24.

Cloud Computing and Resource Management인용 수 21

한 줄 요약

이 설문조사는 GPU 데이터센터에서의 DL 워크로드 스케줄링을 분석하고, 학습 및 추론 스케줄러를 분류하며, 도전 과제를 개략하고 미래 방향을 제안한다.

ABSTRACT

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers

연구 동기 및 목표

GPU 데이터센터 스케줄링에 영향을 미치는 DL 워크로드 특성 식별.
목표 범주에 걸쳐 DL 학습 및 추론에 대한 기존 스케줄러 조사.
DL 고유의 스케줄링 문제를 해결하기 위한 메커니즘 분석.
한계 지적 및 향후 스케줄러 설계 방향 제시.

제안 방법

효율성, 공정성, 대기시간 등의 목표와 GPU 이질성, 공유, 메모리, 인터커넥트 등의 자원 사용 측면에서 스케줄링 솔루션 분류.
2017–2022년의 대표적 DL 학습 및 추론 스케줄러를 요약하고 DL 고유의 도전과제에의 접근을 매핑.
배치(placement), 선점(preemption), 프로파일링, 탄력성(elasticity)과 같은 설계 고려사항 분석.
성능 모델링, 트레이스 분석, 워크로드 특성화와 같은 활성화 기법 논의.
DL 스케줄링과 전통적인 HPC/빅데이터 스케줄러를 대조하여 고유한 요구사항 식별.

실험 결과

연구 질문

RQ1GPU 데이터센터에서 DL 워크로드를 스케줄링하는 주요 도전과제는 무엇인가?
RQ2존재하는 스케줄러들이 목표를 달성하기 위해 공통 전략을 공유하는가?
RQ3빠르게 발전하는 DL 기술 개발에 맞춰 스케줄러를 어떻게 개선해야 하는가?
RQ4DL 데이터센터의 학습(Training)과 추론 스케줄링 간의 주요 설계 트레이드오프는 무엇인가?

주요 결과

DL 학습과 추론은 scheduler 설계에 영향을 주는 고유의 목표와 자원 필요성이 있다.
많은 스케줄러들이 의사결정을 개선하기 위해 성능 모델링, 프로파일링, 워크로드 추적을 사용한다.
DL 고유의 도전과제에는 집중적 자원 사용, 이질적 친화성, 학습에 대한 선점 오버헤드가 포함되며, 추론의 경우 낮은 활용도와 지연-정확도-비용의 트레이드오프가 있다.
기존 솔루션은 종종 임시적이고 특정 목표에 특화되며 DL 워크로드 전반에 걸친 통합적 접근은 제한적이다.
이 설문조사는 GPU 데이터센터의 DL 워크로드 복잡성을 다루기 위한 향후 방향을 제시한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.