QUICK REVIEW

[논문 리뷰] HyperImpute: Generalized Iterative Imputation with Automatic Model Selection

Daniel Jarrett, Bogdan Cebere|arXiv (Cornell University)|2022. 06. 15.

Machine Learning in Healthcare인용 수 23

한 줄 요약

HyperImpute는 열(column)별 모델과 하이퍼파라미터를 자동으로 구성하는 일반화된 반복적(Imputation) 보간 프레임워크를 도입하고, Iterative imputation 루프 내에서 모델 선택을 자동화하기 위해 AutoML을 통합합니다. MAR 설정하에서 전통적인 벤치마크에 비해 실험적으로 강력한 이득을 보여줍니다.

ABSTRACT

Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.

연구 동기 및 목표

MCAR/MAR 설정에서 보간 문제를 동기화하고 기존 방법의 한계를 강조한다.
열별로 자동으로 모델과 하이퍼파라미터를 선택하는 일반화된 반복 보간을 제안한다.
바로 사용할 수 있는 학습자, 최적화 도구, 시뮬레이터, 인터페이스를 갖춘 실용적이고 확장 가능한 구현을 제공한다.
다양한 데이터셋과 누락 메커니즘에 걸쳐 HyperImpute를 강력한 벤치마크와 비교하여 경험적으로 평가한다.

제안 방법

누락 마스크를 가진 불완전한 데이터와 보간 문제를 형식화한다.
각 열에 대해 단변량 모델과 하이퍼파라미터 공간을 탐색하는 일반화된 반복 보간을 도입한다.
루프 내부에서 열별로 모델/하이퍼파라미터를 선택하기 위한 Automatic Model Selection(AutoML)을 개발한다(Inside-Out Search).
sklearn 파이프라인과 호환되는 플러그 앤 플레이 학습자, 최적화 도구(Hyperband 등), 보간기를 제공하는 실용적 구현을 제공한다.
MAR 하에서 UCI 데이터셋에 대해 폭넓은 실험을 수행하고, ICE, MissForest, GAIN, MIWAE, Sinkhorn, MIRACLE 등 최신 벤치마크를 비교한다(부록의 추가 설정 포함).

실험 결과

연구 질문

RQ1자동 모델 선택을 통한 반복 보간이 MAR 설정에서 복잡한 생성 모델을 능가할 수 있는가?
RQ2적응적 자동 선택이 열별 모델링을 통해 보간 정확도와 분포적 충실성을 향상시키는가?
RQ3HyperImpute의 성능 향상의 원천은 무엇인가(열별 구성, 모델 선택, 적응성, 기본 학습자)?
RQ4HyperImpute는 어떤 수렴 특성과 반복 및 데이터 세트에서의 동작을 보이는가?
RQ5HyperImpute는 MCAR/MAR(Appendix의 일부 MNAR 분석 포함) 및 다양한 데이터셋 특성에서 강건한가?

주요 결과

HyperImpute는 MAR에서 12개 중 10개 UCI 데이터 세트에서 RMSE와 Wasserstein 거리 모두에서 벤치마크를 능가한다(누락은 30%).
민감도 분석 전반에서 더 많은 샘플과 더 많은 특징에서 HyperImpute의 성능 이점이 커진다.
MAR 설정에서 Baselines보다 더 낮은 Wasserstein 거리를 달성하여 분포적 충실도가 더 좋음을 시사한다.
모델 선택은 데이터 세트와 반복에 따라 다양한 학습자가 선택됨을 보여주며 적응적 열별 구성을 입증한다.
Inside-Out 탐색 전략은 자동 모델 선택을 가능하게 하면서도 계산 비용이 과도하지 않게 유지하고 반복적인 보간의 이점을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.