[論文レビュー] Can Post-Training Transform LLMs into Causal Reasoners?
The paper systematically studies whether post-training can turn small LLMs into effective causal reasoners using CauGym, finding online RL methods, especially GRPO, yield top performance (93.5% CaLM), outperforming larger models.
Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at https://github.com/OpenCausaLab/CauGym.
研究の動機と目的
- Motivate the need for accessible causal inference and counterfactual reasoning for non-experts.
- Evaluate how five post-training approaches affect LLMs’ causal inference abilities.
- Introduce CauGym, a training set of seven causal tasks and five test sets, plus evaluation on nine datasets.
- Compare SFT, DPO, KTO, PPO, and GRPO across in-domain and benchmark datasets.
- Demonstrate whether targeted post-training yields robust, generalizable causal reasoners.
提案手法
- Construct CauGym by generating synthetic SCM-based DAGs with seven causal tasks (ATE, CDE, ETT, NDE, NIE, PN, PS).
- Apply two-stage training: cold-start with SFT, followed by five post-training methods (SFT, PPO, GRPO, DPO, KTO).
- Evaluate on nine test sets to assess generalization, internalization, and robustness.
- Report accuracy as the evaluation metric with five independent runs for reliability.
- Provide two post-training adaptations per method (e.g., positive/negative samples for offline RL, chain-of-thought prompts for SFT).
- Compare against baselines including several large LLMs.
実験結果
リサーチクエスチョン
- RQ1Can LLMs become effective causal reasoners through post-training?
- RQ2Which post-training method best enhances causal inference, and by how much?
- RQ3Do post-trained LLMs generalize to rephrased questions, internalize causal theorems, and remain robust under noise or incomplete data?
- RQ4How do small LLMs after post-training compare to larger models on causal benchmarks?
主な発見
| LLM | ATE | CDE | ETT | NDE | NIE | PN | PS | Avg. |
|---|---|---|---|---|---|---|---|---|
| Llama-3.3-70B | 0.572 | 0.372 | 0.288 | 0.430 | 0.200 | 0.010 | 0.010 | 0.269 |
| Qwen3-235B | 0.004 | 0.000 | 0.180 | 0.230 | 0.000 | 0.000 | 0.000 | 0.059 |
| DeepSeek-R1-0528-671B | 0.740 | 0.540 | 0.220 | 0.460 | 0.450 | 0.780 | 0.800 | 0.570 |
| Gemini 2.5 Pro | 0.760 | 0.710 | 0.320 | 0.590 | 0.470 | 0.240 | 0.050 | 0.448 |
| OpenAI o3 | 0.840 | 0.590 | 0.300 | 0.430 | 0.720 | 0.450 | 0.550 | 0.554 |
| DeepSeek-R1-Distill-Qwen-14B | 0.594 | 0.364 | 0.210 | 0.442 | 0.212 | 0.014 | 0.066 | 0.272 |
| Cold Start Base | 0.634 | 0.550 | 0.156 | 0.294 | 0.434 | 0.788 | 0.714 | 0.510 |
| SFT | 0.852 | 0.828 | 0.470 | 0.560 | 0.604 | 0.858 | 0.766 | 0.702 |
| DPO | 0.656 | 0.514 | 0.198 | 0.282 | 0.510 | 0.806 | 0.708 | 0.524 |
| KTO | 0.716 | 0.674 | 0.232 | 0.412 | 0.472 | 0.812 | 0.700 | 0.574 |
| PPO | 0.972 | 0.982 | 0.806 | 0.926 | 0.924 | 0.940 | 0.902 | 0.921 |
| GRPO | 0.990 | 0.994 | 0.900 | 0.940 | 0.930 | 0.928 | 0.866 | 0.935 |
- GRPO achieves the best overall performance with 93.5% average on CaLM, surpassing DeepSeek-R1-0528-671B (57.0%) and OpenAI o3 (55.4%).
- Online RL methods (PPO and GRPO) consistently outperform offline RL (DPO, KTO) and SFT across metrics.
- Compared to cold-start, SFT, DPO, KTO, PPO, and GRPO improve average CaLM accuracy by 19.2%, 1.4%, 6.4%, 41.1%, and 42.5% respectively.
- Online RL methods show strong generalization to rephrased inputs and robustness to distribution shifts and noisy data, outperforming offline methods.
- DeepSeek-R1-0528-671B is less robust when instructions are removed, highlighting the value of online RL for causal reasoning.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。