[论文解读] Practical Multi-fidelity Bayesian Optimization for Hyperparameter Tuning
Introduces taKG and taKG_empty, trace-aware knowledge-gradient acquisition functions for multi-fidelity Bayesian optimization in hyperparameter tuning, leveraging trace observations and multiple continuous fidelity controls to improve efficiency.
Bayesian optimization is popular for optimizing time-consuming black-box objectives. Nonetheless, for hyperparameter tuning in deep neural networks, the time required to evaluate the validation error for even a few hyperparameter settings remains a bottleneck. Multi-fidelity optimization promises relief using cheaper proxies to such objectives --- for example, validation error for a network trained using a subset of the training points or fewer iterations than required for convergence. We propose a highly flexible and practical approach to multi-fidelity Bayesian optimization, focused on efficiently optimizing hyperparameters for iteratively trained supervised learning models. We introduce a new acquisition function, the trace-aware knowledge-gradient, which efficiently leverages both multiple continuous fidelity controls and trace observations --- values of the objective at a sequence of fidelities, available when varying fidelity using training iterations. We provide a provably convergent method for optimizing our acquisition function and show it outperforms state-of-the-art alternatives for hyperparameter tuning of deep neural networks and large-scale kernel learning.
研究动机与目标
- Aim to reduce the computational burden of hyperparameter tuning by using cheaper low-fidelity evaluations.
- Develop a flexible acquisition function that leverages trace information across training iterations and other fidelity controls.
- Provide a provably convergent optimization method and demonstrate improvements over state-of-the-art baselines.
- Offer variants that avoid overemphasizing very low fidelities and support batch and derivative-enabled settings.
提出的方法
- Define a multi-fidelity GP model g(x,s) with x as hyperparameters and s as fidelity controls, including trace fidelities and non-trace fidelities.
- Introduce L_n, the expected post-observation loss, to quantify improvement from observing g at a set of fidelities S for a given x.
- Propose taKG: acquisition function that maximizes VOI_n(x,S) = L_n(empty) - L_n(x,S) divided by cost, with S of limited cardinality.
- Provide taKG_empty as a zero-avoiding variant to mitigate sampling at near-zero fidelities where information value vanishes.
- Develop an unbiased stochastic gradient estimator for the gradient of L_n and use multistart stochastic gradient ascent to optimize taKG and taKG_empty.
- Describe warm-starting for trace fidelities, and a cost model via a separate GP to account for evaluation costs.
- Extend to batch and derivative-enabled settings, and discuss efficient optimization despite lack of closed-form acquisition values.
实验结果
研究问题
- RQ1How can trace information across training iterations be effectively incorporated into multi-fidelity Bayesian optimization for hyperparameter tuning?
- RQ2Can a provably convergent acquisition function be designed to efficiently balance information gain and cost across multiple continuous fidelities?
- RQ3How do warm-starting and cost modeling impact the practical performance of multi-fidelity Bayesian optimization for neural networks and kernel learning?
- RQ4Do taKG and taKG_empty outperform existing multi-fidelity and single-fidelity Bayesian optimization methods on neural networks and large-scale kernel learning?
- RQ5Can the methods accommodate batch evaluations and derivative information to further improve efficiency?
主要发现
- taKG and taKG_empty demonstrate improved performance over state-of-the-art baselines such as FaBOLAS, Hyperband, and BOCA in neural network hyperparameter tuning and large-scale kernel learning.
- Using multiple fidelities and trace observations yields significant efficiency gains in sequential and batch settings.
- The proposed stochastic-gradient-based optimization of the acquisition function converges to stationary points under suitable conditions.
- The 0-avoiding variant taKG_empty mitigates excessive sampling at near-zero fidelities without requiring manual cost tuning.
- The approach remains applicable to problems without trace observations while retaining strong performance when continuous fidelities are used.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。