QUICK REVIEW

[论文解读] HyperImpute: Generalized Iterative Imputation with Automatic Model Selection

Daniel Jarrett, Bogdan Cebere|arXiv (Cornell University)|Jun 15, 2022

Machine Learning in Healthcare被引用 23

一句话总结

HyperImpute 引入了一种通用的迭代插补框架，能够自动配置列级模型与超参数，并在迭代插补循环中整合 AutoML 以选择模型。在 MAR 设置下，相对于传统基准显示出显著的实证收益。

ABSTRACT

Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.

研究动机与目标

在 MCAR/MAR 设置下动机化并形式化插补问题，强调现有方法的局限性。
提出能够自动选择列级模型和超参数的通用迭代插补。
提供一个实用、可扩展的实现，配备开箱即用的学习器、优化器、仿真器和接口。
在多样数据集和缺失机制下，对 HyperImpute 进行与强基准的实证评估。

提出的方法

用缺失掩码形式化不完整数据和插补问题。
引入在每一列的单变量模型与超参数空间中进行搜索的通用迭代插补。
开发自动模型选择（AutoML），在迭代循环内为每列选择模型/超参数（Inside-Out Search）。
提供一个实用实现，具备即插即用的学习者、优化器（如 Hyperband）和与 sklearn 管道兼容的插补器。
在 MAR 下对 UCI 数据集进行全面实验（以及附录中的其他设置），与 ICE、MissForest、GAIN、MIWAE、Sinkhorn 和 MIRACLE 等最先进基线进行比较。

实验结果

研究问题

RQ1在 MAR 设置下，带自动模型选择的迭代插补是否可超越复杂生成模型？
RQ2带自适应自动选择的列级建模能否提升插补精度和分布保真度？
RQ3HyperImpute 性能提升的来源有哪些（列级规范、模型选择、适应性、基础学习器）？
RQ4HyperImpute 如何在迭代和不同数据集上收敛与表现？
RQ5HyperImpute 在不同缺失机制（MCAR/MAR，附录中有部分 MNAR 分析）以及不同数据集特征下是否具有鲁棒性？

主要发现

HyperImpute 在 12 个 UCI 数据集中的 10 个数据集上，在 MAR、30% 缺失率下，在 RMSE 和 Wasserstein 距离上均优于基准。
在敏感性分析中，HyperImpute 的性能优势随着样本量和特征数量的增加而增大。
HyperImpute 在 MAR 设置下获得比基线更低的 Wasserstein 距离，表明分布保真度更好。
模型选择显示在不同数据集和迭代中选择了多样化的学习器，展示了自适应的列级配置。
Inside-Out 搜索策略使自动模型选择在计算代价不过高的情况下实现，维持迭代插补的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。