QUICK REVIEW

[论文解读] Device Placement Optimization with Reinforcement Learning

Azalia Mirhoseini, Hieu Pham|arXiv (Cornell University)|Jun 13, 2017

Industrial Vision Systems and Defect Detection参考文献 47被引用 220

一句话总结

论文学习使用一个序列到序列策略（通过 REINFORCE 优化）来优化 TensorFlow 的神经网络设备放置，相比手工设计的启发式方法和 Scotch 基线，放置更快。

ABSTRACT

The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-to-sequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for language modeling and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.

研究动机与目标

通过在异构硬件上更好的设备放置，降低训练/推理成本。
提出一个学习策略，将图中的运算分配到设备，以最小化执行时间。
在多种模型上展示相对人类设计放置和传统图划分方法的改进，包括 Inception-V3、NMT、RNNLM。

提出的方法

将设备放置建模为对 TF 图操作的离散优化，采用策略 π(P|G;θ)。
使用带注意力的 seq2seq 模型来预测图中每个操作的设备。
通过策略梯度（REINFORCE）训练，使用 R(P)=sqrt(r(P)) 作为奖励信号，带移动平均基线。
引入共定位组以降低序列长度并管理大图。
实现异步分布式训练，使用多个控制器和工作节点对放置进行采样和评估。
通过在实际硬件上执行放置来测量运行时间，并将这些时间用作奖励。

实验结果

研究问题

RQ1学习到的策略是否能在 TF 图的设备放置上超越手工设计和基于 Scotch 的基线？
RQ2在 Inception-V3、NMT、RNNLM 等不同模型中，学习到的放置在计算与通信之间的权衡是什么？
RQ3使用 RL 基于放置与专家设计放置相比，端到端训练时间和每步延迟有何变化？

主要发现

模型	单核 CPU	单 GPU	#GPU 数量	Scotch	MinCut	专家设计	基于 RL	加速比
RNNLM	6.89	1.57	2	13.43	11.94	3.81	1.57	0.0%
NMT	10.72	OOM	2	14.19	11.54	4.99	4.04	23.5%
Inception-V3	26.21	4.60	2	25.24	22.88	11.22	4.60	0.0%

RL 放置在多个模型上找到了非平凡的配置，优于手工基线和 Scotch。
使用 RL 放置的单步运行时间比基线快至 3.5x。
使用 RL 放置的端到端训练在 NMT 上比专家设计快约 28%，在 Inception-V3 上约 20%。
对于 NMT，RL 放置比专家放置更好地平衡了设备间的计算负载，降低反向传播过程中的瓶颈。
对于 Inception-V3，RL 放置通过将参数与其消费者放在同一设备，减少了设备间数据拷贝，在多 GPU 设置下实现更快的每步时间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。