Skip to main content
QUICK REVIEW

[论文解读] Practical Multi-fidelity Bayesian Optimization for Hyperparameter Tuning

Jian Wu, Saul Toscano-Palmerin|arXiv (Cornell University)|Mar 12, 2019
Advanced Multi-Objective Optimization Algorithms被引用 56
一句话总结

Introduces taKG and taKG_empty, trace-aware knowledge-gradient acquisition functions for multi-fidelity Bayesian optimization in hyperparameter tuning, leveraging trace observations and multiple continuous fidelity controls to improve efficiency.

ABSTRACT

Bayesian optimization is popular for optimizing time-consuming black-box objectives. Nonetheless, for hyperparameter tuning in deep neural networks, the time required to evaluate the validation error for even a few hyperparameter settings remains a bottleneck. Multi-fidelity optimization promises relief using cheaper proxies to such objectives --- for example, validation error for a network trained using a subset of the training points or fewer iterations than required for convergence. We propose a highly flexible and practical approach to multi-fidelity Bayesian optimization, focused on efficiently optimizing hyperparameters for iteratively trained supervised learning models. We introduce a new acquisition function, the trace-aware knowledge-gradient, which efficiently leverages both multiple continuous fidelity controls and trace observations --- values of the objective at a sequence of fidelities, available when varying fidelity using training iterations. We provide a provably convergent method for optimizing our acquisition function and show it outperforms state-of-the-art alternatives for hyperparameter tuning of deep neural networks and large-scale kernel learning.

研究动机与目标

  • Aim to reduce the computational burden of hyperparameter tuning by using cheaper low-fidelity evaluations.
  • Develop a flexible acquisition function that leverages trace information across training iterations and other fidelity controls.
  • Provide a provably convergent optimization method and demonstrate improvements over state-of-the-art baselines.
  • Offer variants that avoid overemphasizing very low fidelities and support batch and derivative-enabled settings.

提出的方法

  • Define a multi-fidelity GP model g(x,s) with x as hyperparameters and s as fidelity controls, including trace fidelities and non-trace fidelities.
  • Introduce L_n, the expected post-observation loss, to quantify improvement from observing g at a set of fidelities S for a given x.
  • Propose taKG: acquisition function that maximizes VOI_n(x,S) = L_n(empty) - L_n(x,S) divided by cost, with S of limited cardinality.
  • Provide taKG_empty as a zero-avoiding variant to mitigate sampling at near-zero fidelities where information value vanishes.
  • Develop an unbiased stochastic gradient estimator for the gradient of L_n and use multistart stochastic gradient ascent to optimize taKG and taKG_empty.
  • Describe warm-starting for trace fidelities, and a cost model via a separate GP to account for evaluation costs.
  • Extend to batch and derivative-enabled settings, and discuss efficient optimization despite lack of closed-form acquisition values.

实验结果

研究问题

  • RQ1How can trace information across training iterations be effectively incorporated into multi-fidelity Bayesian optimization for hyperparameter tuning?
  • RQ2Can a provably convergent acquisition function be designed to efficiently balance information gain and cost across multiple continuous fidelities?
  • RQ3How do warm-starting and cost modeling impact the practical performance of multi-fidelity Bayesian optimization for neural networks and kernel learning?
  • RQ4Do taKG and taKG_empty outperform existing multi-fidelity and single-fidelity Bayesian optimization methods on neural networks and large-scale kernel learning?
  • RQ5Can the methods accommodate batch evaluations and derivative information to further improve efficiency?

主要发现

  • taKG and taKG_empty demonstrate improved performance over state-of-the-art baselines such as FaBOLAS, Hyperband, and BOCA in neural network hyperparameter tuning and large-scale kernel learning.
  • Using multiple fidelities and trace observations yields significant efficiency gains in sequential and batch settings.
  • The proposed stochastic-gradient-based optimization of the acquisition function converges to stationary points under suitable conditions.
  • The 0-avoiding variant taKG_empty mitigates excessive sampling at near-zero fidelities without requiring manual cost tuning.
  • The approach remains applicable to problems without trace observations while retaining strong performance when continuous fidelities are used.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。