QUICK REVIEW

[Paper Review] Tabular Benchmarks for Joint Architecture and Hyperparameter Optimization

Aaron Klein, Frank Hutter|arXiv (Cornell University)|May 13, 2019

Machine Learning and Data Classification25 references33 citations

TL;DR

This paper provides cheap, tabular benchmarks for neural architecture and hyperparameter optimization by exhaustively evaluating a fixed two-layer feedforward network across four regression datasets, enabling robust, reproducible comparisons of HPO methods.

ABSTRACT

Due to the high computational demands executing a rigorous comparison between hyperparameter optimization (HPO) methods is often cumbersome. The goal of this paper is to facilitate a better empirical evaluation of HPO methods by providing benchmarks that are cheap to evaluate, but still represent realistic use cases. We believe these benchmarks provide an easy and efficient way to conduct reproducible experiments for neural hyperparameter search. Our benchmarks consist of a large grid of configurations of a feed forward neural network on four different regression datasets including architectural hyperparameters and hyperparameters concerning the training pipeline. Based on this data, we performed an in-depth analysis to gain a better understanding of the properties of the optimization problem, as well as of the importance of different types of hyperparameters. Second, we exhaustively compared various different state-of-the-art methods from the hyperparameter optimization literature on these benchmarks in terms of performance and robustness.

Motivation & Objective

Facilitate empirical evaluation of HPO methods with realistic yet inexpensive benchmarks.
Characterize the properties of the optimization problem across a large grid of configurations.
Assess the importance of architectural vs. training hyperparameters in neural network tuning.
Compare a range of state-of-the-art HPO methods on standardized benchmarks.
Provide data and code to enable reproducible experiments in neural HPO/NAS research.

Proposed method

Construct a large grid of configurations for a two-layer feedforward neural network with four architectural choices and five training/hyperparameters, yielding 62,208 configurations after discretization.
Train each configuration on four UCI regression datasets (protein, slice, naval, Parkinson) with 60/20/20 train/val/test splits, normalizing features and targets.
Repeat each configuration four times with different seeds and record training/validation/test errors, training time, and parameter counts across epochs.
Analyze dataset properties and hyperparameter importance using ECDFs, Spearman correlations across budgets, and fANOVA for global importance and pairwise interactions.
Benchmark multiple HPO methods (random search, SMAC, TPE, Bohamiann, Regularized Evolution, Hyperband/BOHB, RL) using 500 independent runs per method, reporting regret and robustness.

Experimental results

Research questions

RQ1What are the empirical properties and difficulty characteristics of the HPO/NAS search space captured by the benchmark datasets?
RQ2Which hyperparameters (and interactions) most influence final performance across datasets?
RQ3How do different HPO methods perform and how robust are they on these tabular benchmarks?
RQ4Do rankings of configurations remain stable across budgets and datasets, enabling effective multi-fidelity optimization?
RQ5Can these benchmarks support reproducible evaluation and fair comparison of HPO methods?

Key findings

There is substantial variability in final error across configurations, with some achieving low MSE and many outliers with much higher errors.
Initial learning rate is a highly important hyperparameter on average, but higher-order interactions dominate in parts of the space.
The incumbent configuration shows robustness fragility to some hyperparameter flips, with activation choices (relu vs tanh) notably impactful.
Best configurations vary modestly across datasets, but some parameters (e.g., initial LR) remain consistently effective across all datasets.
Bayesian optimization methods and multivariate approaches (BOHB) outperform random search early on, with later convergence differing by internal models; reinforcement learning can achieve top final performance but is less sample-efficient; reinforcement-based methods and Bohamiann display robustness trade-offs.
Rankings of configurations correlate across datasets when considering all configurations, but correlations weaken for only the top performers, suggesting value in multi-task data use.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.