QUICK REVIEW

[논문 리뷰] Classification of datasets with imputed missing values: does imputation quality matter?

Tolou Shadbahr, Michael Roberts|arXiv (Cornell University)|2022. 06. 16.

Machine Learning and Data Classification참고 문헌 58인용 수 7

한 줄 요약

이 논문은 기계학습에서 데이터 보정 품질이 후속 분류 성능에 영향을 미치는지 조사한다. 기존의 RMSE와 같은 지표가 모델 성능을 예측하는 데 빈약하다는 것을 입증하면서, 더 나은 보정 충실도 평가를 위해 조각난 워셔스타인 거리 기반의 새로운 유형의 이질성 점수를 제안한다. 놀랍게도, 낮은 품질의 보정 데이터로부터도 높은 성능을 내는 분류기가 도출될 수 있으며, 이는 잘못된 특성 중요도 할당과 해석 가능성 감소를 초래한다.

ABSTRACT

BACKGROUND: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier's performance. METHODS: We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. RESULTS: The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. CONCLUSIONS: It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.

연구 동기 및 목표

데이터 보정 품질이 기계학습에서 후속 분류 성능에 영향을 미치는지 조사하는 것.
RMSE, MAE, R²와 같은 표준 보정 품질 지표의 한계를 평가하여 진정된 데이터 분포 충실도를 반영하지 못하는지 확인하는 것.
더 정확한 보정 품질 평가를 위한 조각난 워셔스타인 거리 기반의 새로운 이질성 점수 유형을 개발하고 검증하는 것.
보정 품질과 모델의 해석 가능성 간의 연관성을 조사하며, 특히 잘못된 특성 중요도 할당 측면에서 분석하는 것.
재현 가능한 보정 및 분류 파이프라인 벤치마킹을 위한 공개된 코드베이스 제공

제안 방법

전체 특성 분포를 얼마나 잘 재구성하는지 평가하기 위해 조각난 워셔스타인 거리 기반의 새로운 이질성 점수 유형을 제안한다.
다중요인 분산분석(ANOVA)을 사용하여 보정 방법, 분류기 선택, 누락 비율이 후속 분류 AUC에 미치는 영향을 정량화한다.
통제된 누락성을 가진 시뮬레이션 데이터와 실제 임상 데이터셋(Breast Cancer, MIMIC-III, NHSX COVID-19, Simulated)을 사용하여 보정 방법을 평가한다.
기존 표준 지표(RMSE, MAE, R²)와 함께 새로운 분포 기반 이질성 점수를 사용하여 보정 방법 간의 품질을 비교한다.
SHAP 값 기반 해석 분석을 통해 보정된 데이터로 학습된 모델에서 특성 중요도를 평가한다.
재현 가능한 보정 및 분류 성능 평가를 위한 공개된 코드베이스와 벤치마킹 프레임워크를 제공한다.

실험 결과

연구 질문

RQ1다양한 데이터셋과 누락 비율에서 보정 방법의 선택이 후속 분류 성능에 어떻게 영향을 미치는가?
RQ2표준 보정 품질 지표(RMSE, MAE 등)가 실제 후속 분류 성능과 얼마나 상관관계가 있는가?
RQ3조각난 워셔스타인 거리 기반의 새로운 이질성 점수 유형이 기존 지표보다 보정 품질을 더 잘 캡처할 수 있는가?
RQ4낮은 보정 품질은 학습된 분류기에서 잘못된 또는 허구적인 특성 중요도 할당을 초래하는가?
RQ5딥러닝 기반 보정 방법(GAIN, MIWAE 등)은 반복 실행 시 안정성이 어떻게 되며, 이는 성능에 어떤 영향을 미치는가?

주요 결과

제안된 조각난 워셔스타인 기반 이질성 점수는 기존 지표(RMSE, MAE)보다 후속 분류 성능과 더 강한 상관관계를 보였다.
낮은 보정 품질에도 불구하고 XGBoost나 신경망과 같은 강력한 분류기는 높은 AUC(예: Simulated 데이터셋에서 최대 0.88)를 달성할 수 있었으며, 이는 노이즈에 대한 강건성을 시사한다.
신경망 구성 요소를 포함한 보정 방법(GAIN, MIWAE)은 반복 실행 시 높은 변동성을 보이며 局부 최소값에 민감함을 시사한다.
낮은 품질의 보정 데이터로 학습된 분류기는 잘못된 중요도를 특성에 할당하여 모델의 해석 가능성과 신뢰도를 떨어뜨린다.
RMSE 및 MAE와 같은 표준 지표는 후속 성능와 상관관계가 없었지만, 분포 기반 이질성 점수(예: 특성별 KL, KS, 워셔스타인)는 유의미한 상관관계를 보였다.
보정 방법과 분류기 선택 간의 상호작용은 성능에 크게 영향을 미치며, NGBoost와 XGBoost는 특히 MIWAE와 MICE를 사용한 잘 보정된 데이터에서 뛰어난 성능을 보였다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.