QUICK REVIEW

[논문 리뷰] Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms

Arthur Dantas Mangussi, Ricardo Cardoso Pereira|arXiv (Cornell University)|2026. 03. 20.

Machine Learning in Healthcare인용 수 0

한 줄 요약

본 논문은 29개 데이터세트(실데이터 및 합성 데이터)를 대상으로 MCAR, MAR, MNAR 하에서 다섯 개의 LLM과 여섯 개의 전통적인 보간 baselines를 비교 벤치마킹하여, LLM이 실제 세계 데이터에서 우수한 성능을 보이지만 환각 가능성과 비용 증가가 있을 수 있으며, 성능은 사전 도메인 지식과 연관된다는 결과를 보여준다.

ABSTRACT

Data imputation is a cornerstone technique for handling missing values in real-world datasets, which are often plagued by missingness. Despite recent progress, prior studies on Large Language Models-based imputation remain limited by scalability challenges, restricted cross-model comparisons, and evaluations conducted on small or domain-specific datasets. Furthermore, heterogeneous experimental protocols and inconsistent treatment of missingness mechanisms (MCAR, MAR, and MNAR) hinder systematic benchmarking across methods. This work investigates the robustness of Large Language Models for missing data imputation in tabular datasets using a zero-shot prompt engineering approach. To this end, we present a comprehensive benchmarking study comparing five widely used LLMs against six state-of-the-art imputation baselines. The experimental design evaluates these methods across 29 datasets (including nine synthetic datasets) under MCAR, MAR, and MNAR mechanisms, with missing rates of up to 20\%. The results demonstrate that leading LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, consistently achieve superior performance on real-world open-source datasets compared to traditional methods. However, this advantage appears to be closely tied to the models' prior exposure to domain-specific patterns learned during pre-training on internet-scale corpora. In contrast, on synthetic datasets, traditional methods such as MICE outperform LLMs, suggesting that LLM effectiveness is driven by semantic context rather than purely statistical reconstruction. Furthermore, we identify a clear trade-off: while LLMs excel in imputation quality, they incur significantly higher computational time and monetary costs. Overall, this study provides a large-scale comparative analysis, positioning LLMs as promising semantics-driven imputers for complex tabular data.

연구 동기 및 목표

제로샷 프롬프트 엔지니어링을 사용하여 표 형 데이터의 누락 데이터 보간에 대한 다수 LLM의 견고성 평가.
오픈된 실제 데이터셋에서 LLM의 사전학습 지식이 전통적 기준선보다 보간 성능을 향상시키는지 여부를 판단.
LLM 기반 보간에서 환각 위험과 의미 맥락의 역할을 조사.
표준화된 누락 메커니즘으로 확장 가능하고 재현 가능한 평가 프레임워크를 제공.

제안 방법

5개의 LLM과 6개의 전통적 기준선을 사용하여 29개 데이터셋(9개 합성, 20개 오픈 소스)의 누락 값을 보간합니다.
시스템 페르소나, 제약 조건, 엄격한 출력 형식을 포함한 배치 프롬프트 구성 전략을 도입하여 강건한 보간을 보장합니다.
MCAR, MAR, MNAR 및 누락 비율 5%, 10%, 20%를 적용한 계층적 5-폴드 교차 검증을 적용합니다.
정규화된 RMSE(NRMSE)로 평가하고 계산 비용(토큰, 시간, 비용)을 분석합니다.
필요 시 재시도 및 평균 보간 대체값을 포함하는 슬라이딩 윈도우 배치 접근으로 LLM에 40x10 부분집합을 제공합니다.

Figure 1: Overview of methodology applied in this work.

실험 결과

연구 질문

RQ1RQ1: 프롬프트 엔지니어링만으로 누락 데이터를 보간할 수 있는가, 아니면 편향이 발생하는가?
RQ2RQ2: 인터넷 규모 말뭉치에서 얻은 LLM의 배경 지식이 보간 성능을 향상시키는가?
RQ3RQ3: 생소한 보간 맥락에서 환각이 더 자주 발생하는가?

주요 결과

MD 메커니즘	5% MNAR	10% MNAR	20% MNAR	5% MCAR	10% MCAR	20% MCAR	5% MAR	10% MAR	20% MAR
SoftImpute	0.654	0.644	0.649	0.273	0.294	0.320	0.311	0.325	0.351
kNN	0.485	0.496	0.509	0.203	0.228	0.256	0.236	0.249	0.284
missForest	0.418	0.440	0.453	0.192	0.218	0.242	0.233	0.242	0.283
MICE	0.426	0.439	0.475	0.174	0.212	0.292	0.211	0.227	0.298
SAEI	0.518	0.482	0.418	0.295	0.313	0.320	0.330	0.333	0.335
TabPFN	0.621	0.683	0.710	0.219	0.276	0.437	0.317	0.354	0.411
Xiaomi: MiMo-V2-Flash	0.439	0.435	0.416	0.207	0.236	0.249	0.204	0.221	0.225
Mistral: Devstral 2 2512	0.435	0.424	0.389	0.210	0.229	0.236	0.207	0.218	0.235
Gemini 3.0 Flash	0.333	0.325	0.308	0.150	0.172	0.185	0.211	0.234	0.200
Claude 4.5 Sonnet	0.369	0.361	0.345	0.153	0.175	0.188	0.168	0.182	0.196
GPT-4.1-Nano	0.432	0.405	0.425	0.221	0.234	0.252	0.221	0.232	0.240

Gemini 3.0 Flash 및 Claude 4.5 Sonnet은 NRMSE 측면에서 실세계 오픈 데이터셋에서 고전적 기준선보다 우수한 보간 성능을 보인다.
합성 데이터셋에서는 전통적 방법(MICE, missForest 등)이 LLM보다 우수할 수 있어, 의미적으로 유도된 맥락이 LLM이 실제 데이터 작업에서 도움을 준다는 것을 시사한다.
LLMs는 보간 품질이 높지만 계산 시간과 금전적 비용이 더 크다.
MNAR 하에서 ML 기반 방법은 여전히 도전적이며, LLM도 의미 맥락의 이점을 얻는다.
다양한 LLM 간 차이는 학습/컷오프 날짜와 사전학습 데이터가 성능에 영향을 미친다는 것을 시사한다.
사후 분석에서 Gemini 3.0 Flash와 Claude 4.5 Sonnet의 전반적 성능 차이가 유의미하지 않다는 것을 보여준다.

Figure 2: Illustration of the complete prompt structure used to perform data imputation via prompt engineering.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.