QUICK REVIEW

[論文レビュー] Large Language Models are Algorithmically Blind

Sohan Venkatesh, Ashish Mahendran Kurapath|arXiv (Cornell University)|Feb 25, 2026

Artificial Intelligence in Healthcare and Education被引用数 0

ひとこと要約

この論文は8つのフロンティアLLMを評価し、因果発見におけるアルゴリズム性能の校正された予測を提供する点で大半が失敗し、アルゴリズム的盲目性を示し、しばしばランダムベースラインと同等以下の性能しか発揮しないことを示す。

ABSTRACT

Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.

研究の動機と目的

フロンティアLLMが因果発見タスクにおけるアルゴリズム性能を校正された不確実性とともに予測できるかを評価する。
LLM予測区間と実測地真値との間の適合度（区間カバレッジ）を用いて校正を定量化する。
ベンチマークデータセットと合成データセットを用いた検証により、メモリ化効果と真の推論を分離する。

提案手法

5200件の因果発見実験を実行して真のアルゴリズム性能を計算し、経験的平均とブートストラップCIを推定する（13データセット×4アルゴリズム×100回）
条件ごとに3つのプロンプト形式で8つのフロンティアLLMを照会し、4つの指標の予測性能区間を引き出す
プロンプト形式を横断して予測を集約し、地真に対する校正済みカバレッジを評価する
LLMをランダムおよびヒューリスティックベースラインと比較して付加価値を評価する
プロンプト感度を係数変動で分析し、データセットタイプ（ベンチマーク vs 合成）の影響を検討する
区間幅（範囲）の信号とモデル間の一致、アルゴリズム-指標の相互作用を調査して、記憶化の兆候を探る。

Figure 1: Comparison of LLM estimates and algorithmic ground truth revealing algorithmic blindness.

実験結果

リサーチクエスチョン

RQ1問題構造が与えられたとき、フロンティアLLMは因果発見アルゴリズムの性能の校正済み区間推定を提供できるか。
RQ2LLMsはアルゴリズム性能を予測する際に principled reasoning に基づくのか、それともベンチマーク統計の記憶を利用しているのか。
RQ3ベンチマークと合成データセット、アルゴリズムおよび指標間でLLMの予測はどのように異なるか。

主な発見

モデル	カバレッジ（%）	比較数	平均スコア
Claude	39.4	82/208	0.442
GPT-5	15.4	32/208	0.217
DeepSeek-Think	14.9	31/208	0.174
DeepSeek	14.4	30/208	0.198
Qwen-Think	13.9	29/208	0.191
Gemini 3	13.0	27/208	0.182
LLaMA	10.1	21/208	0.152
Qwen	5.8	12/208	0.068
Mean	—	—	—

比較664件中の平均校正カバレッジは15.9%であり、8モデル中7つはランダム推定を下回る。
Claudeは最良のパフォーマーで39.4%のカバレッジだが、それでもランダム（36.5%）をわずかに上回るにとどまる。
7モデルはランダムベースラインを下回り、最高モデルの僅かな優位性は推論よりも記憶化に起因する。
校正幅（予測区間）は真の信頼区間の8–27倍に拡大しているが、カバレッジは依然として低い。
合成データでは大幅なカバレージ低下とモデル間の不一致が顕著で、 principled generalization よりも記憶化の影響を示唆する。
アルゴリズム–指標の相互作用と区間幅の圧縮は、構造 conditioned understanding よりもベンチマーク統計の取得を示唆する。

Figure 2: Methodology overview. LLMs are prompted with dataset characteristics and algorithmic assumptions to predict performance metric ranges (Precision, Recall, F1, SHD). Ground-truth metrics with bootstrap 95% confidence intervals are computed via large-scale executions and calibration is evalua

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。