Skip to main content
QUICK REVIEW

[論文レビュー] Can Post-Training Transform LLMs into Causal Reasoners?

Junqi Chen, Sirui Chen|arXiv (Cornell University)|Feb 6, 2026
Bayesian Modeling and Causal Inference被引用数 0
ひとこと要約

The paper systematically studies whether post-training can turn small LLMs into effective causal reasoners using CauGym, finding online RL methods, especially GRPO, yield top performance (93.5% CaLM), outperforming larger models.

ABSTRACT

Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at https://github.com/OpenCausaLab/CauGym.

研究の動機と目的

  • Motivate the need for accessible causal inference and counterfactual reasoning for non-experts.
  • Evaluate how five post-training approaches affect LLMs’ causal inference abilities.
  • Introduce CauGym, a training set of seven causal tasks and five test sets, plus evaluation on nine datasets.
  • Compare SFT, DPO, KTO, PPO, and GRPO across in-domain and benchmark datasets.
  • Demonstrate whether targeted post-training yields robust, generalizable causal reasoners.

提案手法

  • Construct CauGym by generating synthetic SCM-based DAGs with seven causal tasks (ATE, CDE, ETT, NDE, NIE, PN, PS).
  • Apply two-stage training: cold-start with SFT, followed by five post-training methods (SFT, PPO, GRPO, DPO, KTO).
  • Evaluate on nine test sets to assess generalization, internalization, and robustness.
  • Report accuracy as the evaluation metric with five independent runs for reliability.
  • Provide two post-training adaptations per method (e.g., positive/negative samples for offline RL, chain-of-thought prompts for SFT).
  • Compare against baselines including several large LLMs.

実験結果

リサーチクエスチョン

  • RQ1Can LLMs become effective causal reasoners through post-training?
  • RQ2Which post-training method best enhances causal inference, and by how much?
  • RQ3Do post-trained LLMs generalize to rephrased questions, internalize causal theorems, and remain robust under noise or incomplete data?
  • RQ4How do small LLMs after post-training compare to larger models on causal benchmarks?

主な発見

LLMATECDEETTNDENIEPNPSAvg.
Llama-3.3-70B0.5720.3720.2880.4300.2000.0100.0100.269
Qwen3-235B0.0040.0000.1800.2300.0000.0000.0000.059
DeepSeek-R1-0528-671B0.7400.5400.2200.4600.4500.7800.8000.570
Gemini 2.5 Pro0.7600.7100.3200.5900.4700.2400.0500.448
OpenAI o30.8400.5900.3000.4300.7200.4500.5500.554
DeepSeek-R1-Distill-Qwen-14B0.5940.3640.2100.4420.2120.0140.0660.272
Cold Start Base0.6340.5500.1560.2940.4340.7880.7140.510
SFT0.8520.8280.4700.5600.6040.8580.7660.702
DPO0.6560.5140.1980.2820.5100.8060.7080.524
KTO0.7160.6740.2320.4120.4720.8120.7000.574
PPO0.9720.9820.8060.9260.9240.9400.9020.921
GRPO0.9900.9940.9000.9400.9300.9280.8660.935
  • GRPO achieves the best overall performance with 93.5% average on CaLM, surpassing DeepSeek-R1-0528-671B (57.0%) and OpenAI o3 (55.4%).
  • Online RL methods (PPO and GRPO) consistently outperform offline RL (DPO, KTO) and SFT across metrics.
  • Compared to cold-start, SFT, DPO, KTO, PPO, and GRPO improve average CaLM accuracy by 19.2%, 1.4%, 6.4%, 41.1%, and 42.5% respectively.
  • Online RL methods show strong generalization to rephrased inputs and robustness to distribution shifts and noisy data, outperforming offline methods.
  • DeepSeek-R1-0528-671B is less robust when instructions are removed, highlighting the value of online RL for causal reasoning.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。