QUICK REVIEW

[논문 리뷰] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Aurick Qiao, Sang Keun Choe|arXiv (Cornell University)|2020. 08. 27.

Stochastic Gradient Optimization Techniques인용 수 30

한 줄 요약

Pollux는 per-job 구성(배치 크기, 학습률, 그래디언트 누적)과 클러스터 전체 리소스 할당을 함께 최적화하여 goodput(처리량 × 효율)을 모델링하고 전체 DL 학습 성능과 공정성을 극대화한다.

ABSTRACT

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number of resources for each job, often leading to inefficient resource use. Some recent schedulers choose job resources for users, but do so without awareness of how DL training can be re-optimized to better utilize the provided resources. Pollux simultaneously considers both aspects. By monitoring the status of each job during training, Pollux models how their goodput (a novel metric we introduce that combines system throughput with statistical efficiency) would change by adding or removing resources. Leveraging these information, Pollux dynamically (re-)assigns resources to improve cluster-wide goodput, while respecting fairness and continually optimizing each DL job to better utilize those resources. In experiments with real DL jobs and with trace-driven simulations, Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers, even when they are provided with ideal resource and training configurations for every job. Pollux promotes fairness among DL jobs competing for resources based on a more meaningful measure of useful job progress, and reveals a new opportunity for reducing DL cost in cloud environments. Pollux is implemented and publicly available as part of an open-source project at https://github.com/petuum/adaptdl.

연구 동기 및 목표

공유 클러스터에서 DL 워크로드를 효율적으로 스케줄링하는 도전 과제를 제시한다.
개별 작업의 학습 매개변수와 리소스 할당을 함께 최적화하는 공동 적응형 스케줄링 프레임워크를 제안한다.
DL 학습의 시스템 처리율과 통계적 효율성을 모두 포착하는 지표로 goodput를 정의한다.
예측 가능한 스케줄링과 동적 자원 재할당을 가능하게 하도록 처리량과 효율성을 모델링한다.
최신 스케줄러와 비교하여 완료 시간의 상당한 감소와 공정성의 개선을 입증한다.

제안 방법

DL 학습에 대한 시스템 처리율과 통계적 효율성의 곱으로 goodput를 정의한다.
gradient noise scale (PGNS)을 통해 통계적 효율성을 모델링하고 배치 크기에 걸쳐 효율성을 예측한다.
GPU당 배치 크기, 리소스 할당, 그리고 그래디언트 누적에 대한 매개변수를 사용하여 시스템 처리량을 모델링하고, 그래디언트 계산과 동기화 간의 중첩을 포함한다.
두 수준의 스케줄러를 구현한다:각 작업별로 처리량과 효율성 모델을 학습하고 로컬 학습 매개변수를 조정하는 PolluxAgent와, 공정성과 재할당 비용을 고려하면서 클러스터의 goodput를 극대화하기 위해 자원을 재할당하는 PolluxSched 클러스터 전역 스케줄러.
AdaScale, 선형, 제곱근 스케일링과 같은 규칙을 수용하기 위한 플러그인 LR 스케일링 인터페이스를 제공한다.
실제 DL 작업과 트레이스 기반 시뮬레이션에서 모델을 검증하고 Tiresias 및 Optimus 대비 개선과 클라우드 환경에서의 비용 절감을 보고한다.

실험 결과

연구 질문

RQ1다양한 자원 할당 하에서 DL 학습에 대해 처리량과 효율성의 결합 지표로서의 goodput을 어떻게 정의하고 예측할 수 있는가?
RQ2개별 작업의 학습 매개변수를 조정하고 클러스터 자원을 재할당하는 공동 적응형 아키텍처가 완료 시간과 공정성에서 기존 DL 스케줄러를 능가할 수 있는가?
RQ3다양한 배치 크기와 자원 구성에 대해 Pollux가 DL 작업의 처리량과 효율성을 얼마나 정확하게 모델링할 수 있는가?
RQ4그래디언트 누적과 학습률 스케일링이 goodput와 전체 클러스터 성능에 미치는 영향은 무엇인가?
RQ5클라우드 환경에서 goodput 기반 자동 스케일링의 잠재적 비용 절감은 무엇인가?

주요 결과

Pollux는 최첨단 스케줄러의 이상적으로 조정된 기준선과 비교하여 평균 작업 완료 시간을 최대 37%–50% 줄인다.
실험에서 Pollux는 Tiresias 및 Optimus 대비 평균 완료 시간을 최대 73% 감소시킨다.
Pollux는 완료 시간 공정성을 1.5배~5.4배 향상시킨다.
Pollux는 측정된 goodput 모델에 따라 자원을 동적으로 재할당하여 클러스터 전역의 goodput 향상을 입증한다.
클라우드 사례 연구는 goodput 주도 자동 스케일링을 사용하여 대규모 모델 학습에 최대 25%의 비용 절감 가능성을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.