QUICK REVIEW

[논문 리뷰] OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Jinhui Yuan, Xinqi Li|arXiv (Cornell University)|2021. 10. 28.

Advanced Neural Network Applications참고 문헌 40인용 수 34

한 줄 요약

OneFlow는 SBP(split, broadcast, partial-value)와 actor-model 런타임을 도입하여 분산 딥러닝에서 데이터, 모델, 파이프라인 병렬화를 유연하게 지원하고, 기존 프레임워크에 비해 실행을 더 간단하고 효율적으로 만듭니다.

ABSTRACT

Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning. We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks. The code of OneFlow is available at: https://github.com/Oneflow-Inc/oneflow.

연구 동기 및 목표

다양한 병렬성 전략을 자동으로 지원할 수 있는 일반적이고 더 단순한 분산 DL 프레임워크의 필요성을 동기화한다.
SBP(split, broadcast, partial-value)를 텐서/오퍼레이터 병렬성의 일원화 추상화로 제안한다.
안정적인 분산 실행을 위한 명시적 자원 의존성 처리를 갖춘 actor-model 런타임을 도입한다.
SBP 구동 병렬성을 갖는 물리 그래프로 논리 그래프를 변환하는 컴파일러를 제공한다.
최신 시스템과의 광범위한 실험을 통해 일반적 적용성 및 효율성을 입증한다.

제안 방법

SBP를 글로벌 텐서에서 디바이스 및 노드 간의 로컬 텐서로의 다차원 매핑으로 정의한다(S, B, P 시그니처).
연산 입력/출력에 대한 SBP를 추론하여 데이터 병렬성 vs 모델 병렬성 등 병렬성을 지정한다.
SBP를 변환하는 박싱 연산(boxing ops)을 도입하여 다른 SBP 시그니처 간의 데이터 라우팅을 가능하게 한다.
런타임에 액터 모델을 채택하고 각 연산을 명시적 입력/출력 레지스터와 메시지 기반 의존성 메커니즘을 갖는 액터로 만든다.
컴파일 타임 계획 수립과 런타임 역압력을 가능하게 하는 명시적 자원 의존성 카운터(in/out/reference 카운터)를 구현한다.
노드 간 및 노드 내 통신을 라우팅하는 통합 액터 기반 메시지 버스를 사용한다(풀링 기반의 노드 간 데이터 전송).
배치/배치별 API를 단일 디바이스 및 분산 API와 일치시키고, 로우레벨 통신 프리미티브가 아닌 placement/SBP 주석에 의존하는 프로그래밍 인터페이스를 제공한다.

실험 결과

연구 질문

RQ1SBP가 이종 하드웨어 전반의 데이터, 모델, 파이프라인 병렬성을 표현하는 통합적이고 유연한 추상화를 제공할 수 있는가?
RQ2액터 기반 런타임이 전통적인 스케줄러보다 복잡한 의존성과 자원 제약을 더 견고하게 처리할 수 있는가?
RQ3처음부터 설계한 컴파일러+런타임이 기존 프레임워크 위에 구축된 맞춤형 라이브러리와 경쟁력 있는 성능을 달성하는가?

주요 결과

OneFlow는 대표적인 대형 모델 시나리오에서 최신 프레임워크를 기반으로 한 주요 맞춤형 라이브러리와 비교해 동등하거나 약간 더 나은 성능을 달성한다.
SBP는 물리 그래프를 자동으로 생성하는 컴파일러와 함께 기존 프레임워크보다 하이브리드 병렬성(데이터+모델)의 프로그래밍을 더 쉽게 가능하게 한다.
자원 카운터와 역압력을 명시적으로 갖춘 액터 런타임은 안정적인 실행과 디바이스 및 노드 간 자연스러운 파이프라이닝을 지원한다.
데이터 로딩/파이프라이닝은 합성 데이터 및 실제 데이터 사례에서 추가 플러그인(DALI) 없이도 거의 이상적 처리량에 도달할 수 있다.
실험에서 OneFlow가 FP32/FP16 상황에서 데이터 병렬 ResNet 및 BERT에서 공식 TensorFlow, PyTorch, MXNet보다 우수한 성능을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.