QUICK REVIEW

[論文レビュー] Can Post-Training Transform LLMs into Causal Reasoners?

Junqi Chen, Sirui Chen|arXiv (Cornell University)|Feb 6, 2026

Bayesian Modeling and Causal Inference被引用数 0

ひとこと要約

The paper systematically studies whether post-training can turn small LLMs into effective causal reasoners using CauGym, finding online RL methods, especially GRPO, yield top performance (93.5% CaLM), outperforming larger models.

ABSTRACT

Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at https://github.com/OpenCausaLab/CauGym.

研究の動機と目的

Motivate the need for accessible causal inference and counterfactual reasoning for non-experts.
Evaluate how five post-training approaches affect LLMs’ causal inference abilities.
Introduce CauGym, a training set of seven causal tasks and five test sets, plus evaluation on nine datasets.
Compare SFT, DPO, KTO, PPO, and GRPO across in-domain and benchmark datasets.
Demonstrate whether targeted post-training yields robust, generalizable causal reasoners.

提案手法

Construct CauGym by generating synthetic SCM-based DAGs with seven causal tasks (ATE, CDE, ETT, NDE, NIE, PN, PS).
Apply two-stage training: cold-start with SFT, followed by five post-training methods (SFT, PPO, GRPO, DPO, KTO).
Evaluate on nine test sets to assess generalization, internalization, and robustness.
Report accuracy as the evaluation metric with five independent runs for reliability.
Provide two post-training adaptations per method (e.g., positive/negative samples for offline RL, chain-of-thought prompts for SFT).
Compare against baselines including several large LLMs.

実験結果

リサーチクエスチョン

RQ1Can LLMs become effective causal reasoners through post-training?
RQ2Which post-training method best enhances causal inference, and by how much?
RQ3Do post-trained LLMs generalize to rephrased questions, internalize causal theorems, and remain robust under noise or incomplete data?
RQ4How do small LLMs after post-training compare to larger models on causal benchmarks?

主な発見

LLM	ATE	CDE	ETT	NDE	NIE	PN	PS	Avg.
Llama-3.3-70B	0.572	0.372	0.288	0.430	0.200	0.010	0.010	0.269
Qwen3-235B	0.004	0.000	0.180	0.230	0.000	0.000	0.000	0.059
DeepSeek-R1-0528-671B	0.740	0.540	0.220	0.460	0.450	0.780	0.800	0.570
Gemini 2.5 Pro	0.760	0.710	0.320	0.590	0.470	0.240	0.050	0.448
OpenAI o3	0.840	0.590	0.300	0.430	0.720	0.450	0.550	0.554
DeepSeek-R1-Distill-Qwen-14B	0.594	0.364	0.210	0.442	0.212	0.014	0.066	0.272
Cold Start Base	0.634	0.550	0.156	0.294	0.434	0.788	0.714	0.510
SFT	0.852	0.828	0.470	0.560	0.604	0.858	0.766	0.702
DPO	0.656	0.514	0.198	0.282	0.510	0.806	0.708	0.524
KTO	0.716	0.674	0.232	0.412	0.472	0.812	0.700	0.574
PPO	0.972	0.982	0.806	0.926	0.924	0.940	0.902	0.921
GRPO	0.990	0.994	0.900	0.940	0.930	0.928	0.866	0.935

GRPO achieves the best overall performance with 93.5% average on CaLM, surpassing DeepSeek-R1-0528-671B (57.0%) and OpenAI o3 (55.4%).
Online RL methods (PPO and GRPO) consistently outperform offline RL (DPO, KTO) and SFT across metrics.
Compared to cold-start, SFT, DPO, KTO, PPO, and GRPO improve average CaLM accuracy by 19.2%, 1.4%, 6.4%, 41.1%, and 42.5% respectively.
Online RL methods show strong generalization to rephrased inputs and robustness to distribution shifts and noisy data, outperforming offline methods.
DeepSeek-R1-0528-671B is less robust when instructions are removed, highlighting the value of online RL for causal reasoning.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。