QUICK REVIEW

[논문 리뷰] Device Placement Optimization with Reinforcement Learning

Azalia Mirhoseini, Hieu Pham|arXiv (Cornell University)|2017. 06. 13.

Industrial Vision Systems and Defect Detection참고 문헌 47인용 수 220

한 줄 요약

이 논문은 REINFORCE로 최적화된 시퀀스-투-시퀀스 정책을 통해 뉴럴 네트워크의 TensorFlow 디바이스 배치를 최적화하는 방법을 학습하고, hand-crafted 휴리스틱과 Scotch 벤치마크보다 더 빠른 배치를 달성한다.

ABSTRACT

The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-to-sequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for language modeling and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.

연구 동기 및 목표

heterogeneous 하드웨어에서 더 나은 디바이스 배치를 통해 학습/추론 비용 감소를 목표로 한다.
그래프 연산을 디바이스에 배치하는 학습 전략을 제안하여 실행 시간을 최소화한다.
다수의 모델에서 인간이 설계한 배치와 전통적인 그래프 파티션 방법보다 개선을 입증한다.

제안 방법

TF 그래프 연산에 대한 정책 π(P|G;θ)를 이용한 이산 최적화로 디바이스 배치를 모델링한다.
그래프의 각 연산에 대해 디바이스를 예측하는 주의(attention) 기반 seq2seq 모델을 사용한다.
보상 신호로 R(P)=sqrt(r(P))를 사용하는 정책 경사(REINFORCE)로 학습하며 이동 평균 기준선을 사용한다.
연산 시퀀스 길이를 줄이고 큰 그래프를 관리하기 위해 공위치(co-location) 그룹을 통합한다.
다수의 컨트롤러와 워커를 갖춘 비동기 분산 학습으로 배치를 샘플링하고 평가한다.
실제 하드웨어에서 배치를 실행해 실행 시간을 측정하고 이를 보상으로 사용한다.

실험 결과

연구 질문

RQ1학습된 정책이 TF 그래프에서 수작업으로 설계된 배치 및 Scotch 기반 벤치마크를 능가하는가?
RQ2다양한 모델(Inception-V3, NMT, RNNLM)에서 학습된 배치가 활용하는 계산/통신의 트레이드오프는 어떠한가?
RQ3RL 기반 배치를 사용할 때 엔드-투-엔드 학습 시간과 매 스텝 대기시간은 전문가 설계 배치와 비교하여 어떻게 달라지는가?

주요 결과

모델	단일 CPU	단일 GPU	#GPU 수	Scotch	MinCut	Expert	RL 기반	속도 향상
RNNLM	6.89	1.57	2	13.43	11.94	3.81	1.57	0.0%
NMT	10.72	OOM	2	14.19	11.54	4.99	4.04	23.5%
Inception-V3	26.21	4.60	2	25.24	22.88	11.22	4.60	0.0%

RL 배치는 여러 모델에서 수작업으로 설계된 배치 및 Scotch 벤치마크를 능가하는 비트 최적 구성을 찾는다.
RL 배치의 단일 스텝 실행 시간은 벤치마크보다 최대 3.5x 빠르다.
RL 기반 배치로 인한 엔드투엔드 학습은 NMT에서 최대 약 28%, Inception-V3에서 약 20%의 학습 속도 향상을 달성한다(전문가 설계 대비).
NMT의 경우 RL 배치는 연산 부하를 디바이스 간에 더 잘 균형 있게 분배하여 역전파 중 병목 현상을 감소시킨다.
Inception-V3의 경우 RL 배치는 매개변수를 소비자와 함께 위치시켜 디바이스 간 데이터 복사를 줄여 멀티-GPU 설정에서 매 스텝 시간이 더 빨라지게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.