QUICK REVIEW

[论文解读] Probabilistic Matrix Factorization for Automated Machine Learning

Nicolò Fusi, Rishit Sheth|arXiv (Cornell University)|May 15, 2017

Machine Learning and Data Classification参考文献 20被引用 64

一句话总结

论文将 AutoML 表述为一个概率矩阵分解问题，使用 Gaussian process 潜在变量模型来预测跨数据集的流水线性能，并引导基于贝叶斯优化驱动的流水线探索。

ABSTRACT

In order to achieve state-of-the-art performance, modern machine learning techniques require careful data pre-processing and hyperparameter tuning. Moreover, given the ever increasing number of machine learning models being developed, model selection is becoming increasingly important. Automating the selection and tuning of machine learning pipelines consisting of data pre-processing methods and machine learning models, has long been one of the goals of the machine learning community. In this paper, we tackle this meta-learning task by combining ideas from collaborative filtering and Bayesian optimization. Using probabilistic matrix factorization techniques and acquisition functions from Bayesian optimization, we exploit experiments performed in hundreds of different datasets to guide the exploration of the space of possible pipelines. In our experiments, we show that our approach quickly identifies high-performing pipelines across a wide range of datasets, significantly outperforming the current state-of-the-art.

研究动机与目标

Automate the selection and tuning of ML pipelines, including data pre-processing and model choices.
Leverage cross-dataset experiment data to predict pipeline performance on new datasets.
Address high-dimensional, mixed (continuous/discrete/categorical) pipeline spaces via instantiation of pipelines.”
Integrate collaborative filtering with Bayesian optimization to guide pipeline exploration.

提出的方法

Model the pipeline-dataset performance matrix Y with probabilistic matrix factorization: Y ≈ XW.
Place Gaussian process priors over nonlinear mappings f_d(x) to capture nonlinearity in pipeline performance.
Use a squared exponential kernel with ARD for the GP priors to model latent function smoothness.
Handle missing data by marginalizing in the GP likelihood and performing stochastic gradient updates for X, θ, and σ^2.
Compute predictions for a new dataset via GP regression formulas with C_d = K(X_e(d), X_e(d)) + σ^2 I.
Use acquisition functions, notably Expected Improvement (EI), to select the next pipeline to evaluate.

实验结果

研究问题

RQ1Can performance across datasets be captured via a low-dimensional latent space of pipelines to predict outcomes on new datasets?
RQ2Does treating pipeline evaluation as a probabilistic matrix factorization task improve AutoML pipeline selection over baselines?
RQ3Can acquisition functions from Bayesian optimization effectively guide the exploration of a discrete/instantiated pipeline space?
RQ4How robust is the approach to missing evaluations in the pipeline-dataset performance matrix?
RQ5Is including explicit pipeline metadata necessary when sufficient data is available?

主要发现

PMF-based AutoML consistently achieves the best average rank across 89 held-out datasets as the number of iterations increases.
The method outperforms auto-sklearn and random-search baselines in average rank and in gap to the best pipeline on held-out datasets.
The approach remains robust even when 90% of the matrix entries Y are missing, still outperforming competitors.
Latent embeddings (dimensionality 20) effectively capture model structure and hyperparameters across pipelines.
Including pipeline metadata does not improve performance when enough experimental data is available, as the model learns from Y alone.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。