QUICK REVIEW

[论文解读] Tune: A Research Platform for Distributed Model Selection and Training

Richard Liaw, Eric Liang|arXiv (Cornell University)|Jul 13, 2018

Machine Learning and Data Classification参考文献 11被引用 615

一句话总结

Tune 提供一个统一的、开源的 API 和调度框架用于分布式模型选择与训练，使在 Ray 之上集成各种超参数搜索算法变得简单。它将训练脚本与搜索逻辑解耦，以在跨集群扩展实验。

ABSTRACT

Modern machine learning algorithms are increasingly computationally demanding, requiring specialized hardware and distributed computation to achieve high performance in a reasonable time frame. Many hyperparameter search algorithms have been proposed for improving the efficiency of model selection, however their adaptation to the distributed compute environment is often ad-hoc. We propose Tune, a unified framework for model selection and training that provides a narrow-waist interface between training scripts and search algorithms. We show that this interface meets the requirements for a broad range of hyperparameter search algorithms, allows straightforward scaling of search to large clusters, and simplifies algorithm implementation. We demonstrate the implementation of several state-of-the-art hyperparameter search algorithms in Tune. Tune is available at http://ray.readthedocs.io/en/latest/tune.html.

研究动机与目标

激发对可扩展、可重复的分布式模型选择与训练的需求。
将 Tune 作为训练脚本与超参数搜索算法之间的窄腰 API 进行介绍。
证明 Tune 能支持广泛的搜索策略，并实现跨框架的易集成。

提出的方法

提出两种 API 设计：用于训练脚本的用户 API 和用于搜索算法的调度 API。
实现一个合作控制模型或 direct-trail 控制在训练期间与 Tune 交互。
提供一个 TrialScheduler 接口，带有 on_result 和 choose_trial_to_run，用于管理并行 Trial。
基于 Ray 框架构建，以处理跨试验的分布式执行、资源管理和数据处理。
实现并集成多种前沿超参数搜索算法（例如 HyperBand 变体、Median Stopping Rule、HyperOpt、Population-Based Training）。
提供一个最小示例和一个用于定义初始试验配置的领域特定 DSL。

实验结果

研究问题

RQ1Tune 是否能够用一个通用 API 支持广泛的超参数优化算法？
RQ2基于 Ray 的实现是否能够实现大量并发试验的可扩展分布式执行？
RQ3中间试验结果是否能够有效用于做出动态调度决策和超参数自适应？
RQ4用户体验是否足够简单，能够在保持可重复性的同时，方便地集成到现有训练脚本？
RQ5Tune 如何促进跨不同调度器的 AutoML 实验的复现实、可视化和比较？

主要发现

Tune 提供窄腰式的用户和调度 API，使不同超参数搜索算法的集成变得容易。
该框架支持不规则和异构的试验工作负载，以及以中间结果驱动的调度决策。
Tune 中实现或集成了多种算法，包括异步 HyperBand、HyperBand、Median Stopping Rule、HyperOpt 和 Population-Based Training。
试验以 Ray 任务/ actors 运行，通过 Ray 进行资源管理和数据处理，从而实现嵌套分布式计算。
Tune 提供一个最小示例和用于网格/搜索配置的 DSL，并通过控制台和 TensorBoard 集成记录进度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。