[论文解读] Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms
paper 在 29 个数据集(真实与合成)上将五个 LLM 与六个传统填补基线在 MCAR、MAR、MNAR 下进行对比,在真实世界数据上 LLM 表现出色但可能产生幻觉并且成本较高,表现与先前领域知识相关。
Data imputation is a cornerstone technique for handling missing values in real-world datasets, which are often plagued by missingness. Despite recent progress, prior studies on Large Language Models-based imputation remain limited by scalability challenges, restricted cross-model comparisons, and evaluations conducted on small or domain-specific datasets. Furthermore, heterogeneous experimental protocols and inconsistent treatment of missingness mechanisms (MCAR, MAR, and MNAR) hinder systematic benchmarking across methods. This work investigates the robustness of Large Language Models for missing data imputation in tabular datasets using a zero-shot prompt engineering approach. To this end, we present a comprehensive benchmarking study comparing five widely used LLMs against six state-of-the-art imputation baselines. The experimental design evaluates these methods across 29 datasets (including nine synthetic datasets) under MCAR, MAR, and MNAR mechanisms, with missing rates of up to 20\%. The results demonstrate that leading LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, consistently achieve superior performance on real-world open-source datasets compared to traditional methods. However, this advantage appears to be closely tied to the models' prior exposure to domain-specific patterns learned during pre-training on internet-scale corpora. In contrast, on synthetic datasets, traditional methods such as MICE outperform LLMs, suggesting that LLM effectiveness is driven by semantic context rather than purely statistical reconstruction. Furthermore, we identify a clear trade-off: while LLMs excel in imputation quality, they incur significantly higher computational time and monetary costs. Overall, this study provides a large-scale comparative analysis, positioning LLMs as promising semantics-driven imputers for complex tabular data.
研究动机与目标
- 评估在表格数据中使用零-shot 提示工程对缺失数据填补的鲁棒性。
- 确定 LLM 的预训练知识是否相较于传统基线在开放真实世界数据集上提升填补效果。
- 研究幻觉风险以及语义上下文在基于 LLM 的填补中的作用。
- 提供可扩展、可重复的评估框架,标准化缺失机制。
提出的方法
- 使用五个 LLM 和六个传统基线在 29 个数据集(9 个合成,20 个开源)上对缺失值进行填充。
- 引入带系统角色、约束与严格输出格式的批量提示构造策略,以确保鲁棒填补。
- 在 MCAR、MAR、MNAR 与缺失率 5%、10%、20% 的条件下应用分层五倍交叉验证。
- 以归一化均方根误差(NRMSE)进行评估,并分析计算成本(tokens、时间与金钱)。
- 采用滑动窗口批处理方式向 LLM 提供 40x10 的子集,允许重试并在需要时提供均值填充回退。

实验结果
研究问题
- RQ1RQ1: 仅通过提示工程,LLMs 能否对缺失数据进行稳健填补,还是会产生偏差?
- RQ2RQ2: LLM 从互联网规模语料库获得的背景知识是否提升了填补性能?
- RQ3RQ3: 在不熟悉的填补场景中,幻觉是否更易发生?
主要发现
| MD Mechanisms | 5% MNAR | 10% MNAR | 20% MNAR | 5% MCAR | 10% MCAR | 20% MCAR | 5% MAR | 10% MAR | 20% MAR |
|---|---|---|---|---|---|---|---|---|---|
| SoftImpute | 0.654 | 0.644 | 0.649 | 0.273 | 0.294 | 0.320 | 0.311 | 0.325 | 0.351 |
| kNN | 0.485 | 0.496 | 0.509 | 0.203 | 0.228 | 0.256 | 0.236 | 0.249 | 0.284 |
| missForest | 0.418 | 0.440 | 0.453 | 0.192 | 0.218 | 0.242 | 0.233 | 0.242 | 0.283 |
| MICE | 0.426 | 0.439 | 0.475 | 0.174 | 0.212 | 0.292 | 0.211 | 0.227 | 0.298 |
| SAEI | 0.518 | 0.482 | 0.418 | 0.295 | 0.313 | 0.320 | 0.330 | 0.333 | 0.335 |
| TabPFN | 0.621 | 0.683 | 0.710 | 0.219 | 0.276 | 0.437 | 0.317 | 0.354 | 0.411 |
| Xiaomi: MiMo-V2-Flash | 0.439 | 0.435 | 0.416 | 0.207 | 0.236 | 0.249 | 0.204 | 0.221 | 0.225 |
| Mistral: Devstral 2 2512 | 0.435 | 0.424 | 0.389 | 0.210 | 0.229 | 0.236 | 0.207 | 0.218 | 0.235 |
| Gemini 3.0 Flash | 0.333 | 0.325 | 0.308 | 0.150 | 0.172 | 0.185 | 0.211 | 0.234 | 0.200 |
| Claude 4.5 Sonnet | 0.369 | 0.361 | 0.345 | 0.153 | 0.175 | 0.188 | 0.168 | 0.182 | 0.196 |
| GPT-4.1-Nano | 0.432 | 0.405 | 0.425 | 0.221 | 0.234 | 0.252 | 0.221 | 0.232 | 0.240 |
- Gemini 3.0 Flash 与 Claude 4.5 Sonnet 在真实世界的开放数据集上在填补质量(NRMSE)方面优于传统基线。
- 在合成数据集上,传统方法(如 MICE、missForest)可能优于 LLM,表明语义驱动的上下文有助于 LLM 在真实数据任务中。
- LLMs 显示出较高的填补质量,但代价是更高的计算时间和金钱成本。
- 在 MNAR 下,基于 ML 的方法仍具挑战性,LLMs 仍受益于语义上下文。
- 不同 LLM 之间的差异表明训练/截止日期和预训练数据会影响性能。
- 事后分析显示 Gemini 3.0 Flash 与 Claude 4.5 Sonnet 在整体性能上无显著差异。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。