[论文解读] Can Large Language Models Infer Causation from Correlation?
本文提出 Corr2Cause,一个大规模基准,用于测试纯粹的相关性推断因果关系,并展示现有大模型在未微调时近似随机,微调后收益混合但对分布外泛化较差。
Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g., commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize -- they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs' pure reasoning skills and generalizability. Our data is at https://huggingface.co/datasets/causalnlp/corr2cause. Our code is at https://github.com/causalNLP/corr2cause.
研究动机与目标
- 评估当前 LLMs 是否能在没有经验性知识的情况下从相关性推断因果关系。
- 构建一个大型数据集以评估 NLP 模型中的纯因果推理能力。
- 分析在 Corr2Cause 上对不同架构的模型及微调后的性能。
- 研究对分布外输入的鲁棒性和泛化能力。
提出的方法
- 定义 Corr2Cause 任务:使用函数 f(s,h) -> v 将相关性陈述和因果关系假设映射到有效性标签。
- 从结构因果模型(DGCMs、d-separation、MECs)和 CAusal discovery 原理生成>200K 的样本数据集。
- 使用受 PC 算法启发的数据生成来确定在马尔可夫等价类中的所有图中假设关系是否成立。
- 将 DS 和假设以自然语言提示形式表达以便对 LLM 进行评估。
- 在 Corr2Cause 上对 17 种 LLM(基于 BERT 的 NLI、RoBERTa、GPT 家族、LLaMA 等)进行零样本和微调设置的评估。
- 通过改写和变量重构进行鲁棒性测试以评估泛化能力。

实验结果
研究问题
- RQ1现成的 LLMs 在纯粹的 Corr2Cause 因果推断任务上表现如何?
- RQ2微调是否能提升 LLMs 的因果推断能力,且增益对分布偏移是否鲁棒?
- RQ3模型是否依赖表面线索还是通过真正的推理(如经由改写和变量重命名等扰动测试)来工作?
主要发现
| 模型 | F1 | 精确率 | 召回率 | 准确率 |
|---|---|---|---|---|
| BART MNLI | 33.38 | 31.59 | 35.38 | 78.50 |
| RoBERTa MNLI | 22.79 | 34.73 | 16.96 | 82.50 |
| DeBERTa MNLI | 14.52 | 14.71 | 14.33 | 74.31 |
| DistilBERT MNLI | 20.70 | 24.12 | 18.13 | 78.85 |
| GPT-3 Davinci | 27.82 | 16.57 | 86.55 | 31.61 |
| GPT-3 Instruct (text-davinci-001) | 17.99 | 11.84 | 37.43 | 48.04 |
| GPT-3 Instruct (text-davinci-002) | 21.87 | 13.46 | 58.19 | 36.69 |
| GPT-3 Instruct (text-davinci-003) | 15.72 | 13.40 | 19.01 | 68.97 |
| GPT-3.5 | 21.69 | 17.79 | 27.78 | 69.46 |
| GPT-4 | 29.08 | 20.92 | 47.66 | 64.60 |
| GPT-3 Ada (finetuned) | 79.85 | 70.47 | 92.11 | 92.92 |
| GPT-3 Babbage (finetuned) | 78.19 | 69.98 | 88.60 | 92.48 |
| GPT-3 Curie (finetuned) | 81.23 | 75.00 | 88.60 | 93.77 |
| GPT-3 Davinci (finetuned) | 85.52 | 80.26 | 91.52 | 95.28 |
| GPT2 (finetuned) | 89.18 | 88.03 | 90.35 | 96.66 |
| GPT2-Large (finetuned) | 94.29 | 92.18 | 96.49 | 98.22 |
| GPT2-XL (finetuned) | 94.30 | 91.94 | 96.78 | 98.22 |
| LLaMA-7B (finetuned) | 91.98 | 88.62 | 95.61 | 97.46 |
| LLaMa2-7B (finetuned) | 92.92 | 90.11 | 95.91 | 97.77 |
| RoBERTa-Large MNLI (finetuned) | 94.74 | 92.24 | 97.37 | 98.35 |
- 大多数现成的 LLMs 在 Corr2Cause 上表现不佳,接近随机基线。
- 在未微调的模型中,最佳综合 F1 为 33.38%(BART MNLI)。
- 微调带来显著提升(例如 RoBERTa-Large MNLI 在原始测试集上达到 94.74% 的 F1),但鲁棒性测试显示在改写或变量重构时显著下降。
- 鲁棒性测试暴露出显著的泛化差距;改写使 F1 下降最多 39.29%,变量重命名下降最多 62.3%。
- 微调模型在各关系上的表现强势(如 Is-Parent、Is-Descendant、Has-Confounder F1 均超过 96%),但 Has-Collider 仍然较弱。
- 该数据集揭示纯粹因果推理的泛化挑战,并强调未来工作中需要对抗对抗性测试。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。