QUICK REVIEW

[論文レビュー] Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms

Arthur Dantas Mangussi, Ricardo Cardoso Pereira|arXiv (Cornell University)|Mar 20, 2026

Machine Learning in Healthcare被引用数 0

ひとこと要約

論文は、29のデータセット（実データと合成データ）にわたり、5つのLLMと6つの伝統的な補完ベースラインを比較し、MCAR、MAR、MNARの下で評価を行った。実データではLLMが優れているが、幻覚が生じやすくコストが高くなる可能性がある。性能は事前のドメイン知識と関連する。

ABSTRACT

Data imputation is a cornerstone technique for handling missing values in real-world datasets, which are often plagued by missingness. Despite recent progress, prior studies on Large Language Models-based imputation remain limited by scalability challenges, restricted cross-model comparisons, and evaluations conducted on small or domain-specific datasets. Furthermore, heterogeneous experimental protocols and inconsistent treatment of missingness mechanisms (MCAR, MAR, and MNAR) hinder systematic benchmarking across methods. This work investigates the robustness of Large Language Models for missing data imputation in tabular datasets using a zero-shot prompt engineering approach. To this end, we present a comprehensive benchmarking study comparing five widely used LLMs against six state-of-the-art imputation baselines. The experimental design evaluates these methods across 29 datasets (including nine synthetic datasets) under MCAR, MAR, and MNAR mechanisms, with missing rates of up to 20\%. The results demonstrate that leading LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, consistently achieve superior performance on real-world open-source datasets compared to traditional methods. However, this advantage appears to be closely tied to the models' prior exposure to domain-specific patterns learned during pre-training on internet-scale corpora. In contrast, on synthetic datasets, traditional methods such as MICE outperform LLMs, suggesting that LLM effectiveness is driven by semantic context rather than purely statistical reconstruction. Furthermore, we identify a clear trade-off: while LLMs excel in imputation quality, they incur significantly higher computational time and monetary costs. Overall, this study provides a large-scale comparative analysis, positioning LLMs as promising semantics-driven imputers for complex tabular data.

研究の動機と目的

0ショットのプロンプト設計を用いた欠測データ補完の複数LLMの頑健性を評価する。
オープンな実世界データセット上で、LLMの事前学習知識が補完性能を従来のベースラインより向上させるかを判定する。
LLMベースの補完における幻覚リスクと意味論的文脈の役割を調査する。
標準化された欠測機構を用いた、スケーラブルで再現性のある評価フレームワークを提供する。

提案手法

5つのLLMと6つの伝統的ベースラインを用いて、29データセット（9つは合成、20はオープンソース）の欠測値を補完する。
システムペルソナ、制約、厳格な出力形式を含むバッチプロンプト構築戦略を導入し、堅牢な補完を確保する。
MCAR、MAR、MNARおよび欠測率5%、10%、20%を用いた分層五分割交差検証を適用する。
正規化二乗平均平方根誤差（NRMSE）で評価し、計算コスト（トークン、時間、費用）を分析する。
必要に応じてリトライと平均補完フォールバックを備えた40x10のサブセットをLLMへ提供するスライディングウィンドウバッチ手法を採用する。

Figure 1: Overview of methodology applied in this work.

実験結果

リサーチクエスチョン

RQ1RQ1: プロンプト設計だけでLLMは欠測データを頑健に補完できるか、それともバイアスが生じるか？
RQ2RQ2: インターネット規模のコーパスからの事前知識が補完性能を高めるか？
RQ3RQ3: 未知の補完コンテキストで幻覚が起こりやすいか？

主な発見

MD Mechanisms	5% MNAR	10% MNAR	20% MNAR	5% MCAR	10% MCAR	20% MCAR	5% MAR	10% MAR	20% MAR
SoftImpute	0.654	0.644	0.649	0.273	0.294	0.320	0.311	0.325	0.351
kNN	0.485	0.496	0.509	0.203	0.228	0.256	0.236	0.249	0.284
missForest	0.418	0.440	0.453	0.192	0.218	0.242	0.233	0.242	0.283
MICE	0.426	0.439	0.475	0.174	0.212	0.292	0.211	0.227	0.298
SAEI	0.518	0.482	0.418	0.295	0.313	0.320	0.330	0.333	0.335
TabPFN	0.621	0.683	0.710	0.219	0.276	0.437	0.317	0.354	0.411
Xiaomi: MiMo-V2-Flash	0.439	0.435	0.416	0.207	0.236	0.249	0.204	0.221	0.225
Mistral: Devstral 2 2512	0.435	0.424	0.389	0.210	0.229	0.236	0.207	0.218	0.235
Gemini 3.0 Flash	0.333	0.325	0.308	0.150	0.172	0.185	0.211	0.234	0.200
Claude 4.5 Sonnet	0.369	0.361	0.345	0.153	0.175	0.188	0.168	0.182	0.196
GPT-4.1-Nano	0.432	0.405	0.425	0.221	0.234	0.252	0.221	0.232	0.240

Gemini 3.0 FlashとClaude 4.5 Sonnetは実データのオープンセットにおける欠落補完品質（NRMSE）で古典的ベースラインを上回る。
合成データセットでは従来法（例：MICE、missForest）がLLMを上回ることがあり、意味論的に駆動された文脈が実データタスクでLLMを援用することを示唆する。
LLMsは欠測補完の品質が高いが、計算時間と金銭コストが高くなる。
MNAR下では機械学習ベースの手法は依然として難しく、LLMsは意味論的文脈から恩恵を受ける。
LLMs間の差は訓練日付や事前学習データが性能に影響を与えることを示唆する。
事後分析ではGemini 3.0 FlashとClaude 4.5 Sonnetの全体的な性能において有意差は見られなかった。

Figure 2: Illustration of the complete prompt structure used to perform data imputation via prompt engineering.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。