QUICK REVIEW

[論文レビュー] InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Shaojie Shi, Zhengyu Shi|arXiv (Cornell University)|Mar 16, 2026

Bayesian Modeling and Causal Inference被引用数 0

ひとこと要約

InterveneBenchはエンドツーエンドの介入中心の因果研究設計を実世界の政策文脈で評価し、STRIDESマルチエージェントフレームワークはエンドツーエンドの社会科学的推論を政策文脈で大幅に改善する。

ABSTRACT

Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art reasoning models. Our code and data are available at https://github.com/Sii-yuning/STRIDES.

研究の動機と目的

LLMsが predefined causal graphsなしで実世界の政策介入の下でエンドツーエンドの因果研究設計を実行できるかを評価する。
構造に依存しないフレームワークを用いて、経験的社会科学研究に因果推論を根拠付ける。
マルチエージェントシステム（STRIDES）がオープンエンドの介入推論と同定戦略の選択を改善するかを検討する。
専門家検証を伴う査読研究から派生したベンチマークデータセット（InterveneBench）を提供する。

提案手法

HITL検証パイプラインで744件の経験的研究から因果設計を半自動抽出し、InterveneBenchを作成する。
Paper Interpreter、Causal Designer、Verifierエージェントを導入し、出典 textsと照合した初期設計を生成する。
STRIDESの3段階ワークフロー：Strategic Research Design、Data Environment Instantiation、Code-Based Analysis with adversarial verificationを適用する。
設計を実行可能なコードと合成データに翻訳して同定戦略と統計的実現性を検証する。
バニラLLMsとSTRIDES強化LLMsを、実世界の因果研究品質に対応する45点のルーブリックで比較する。

Figure 1: Comparison between closed-form mathematical reasoning (Panel A) and open-ended social-science causal inference (Panel B).

実験結果

リサーチクエスチョン

RQ1LLMsはオープンで構造に依存しない条件下で有効な介入中心のエンドツーエンドな因果研究設計を生成できるか。
RQ2STRIDESのマルチエージェントワークフローは、同定戦略、変数設定、妥当性検証のパフォーマンスをバニラモデルと比較して改善するか。
RQ3InterveneBench設計はデータ利用可能性、データ制約、実世界の政策文脈に対してどれだけ頑健か。
RQ4現在のLLMsが理論的根拠とデータ駆動の同定戦略を統合する際の限界は何か。
RQ5ヒト-in-ループの検証がベンチマークの信頼性と専門家の整合性にどれだけ影響するか。

主な発見

Model	Final Score	Model Type	Core IV	Group Def	Controls	Dep Var	Reasoning	Explanation	Improve
STRIDES(GPT-5.1)	0.665	0.515	0.691	0.544	0.842	0.974	0.695	0.650	+15.1%
w/o MAS	0.578	0.493	0.656	0.535	0.652	0.658	0.530	0.517
STRIDES(Claude-3.7-Sonnet)	0.653	0.650	0.728	0.601	0.570	0.892	0.620	0.347	+20.0%
w/o MAS	0.544	0.519	0.578	0.535	0.570	0.656	0.510	0.340
STRIDES(Claude-Sonnet-4)	0.652	0.634	0.734	0.652	0.550	0.816	0.615	0.367	+21.0%
w/o MAS	0.539	0.541	0.573	0.541	0.488	0.656	0.530	0.307
STRIDES(GLM-4.6)	0.642	0.606	0.709	0.646	0.464	0.882	0.600	0.453	+20.9%
w/o MAS	0.531	0.535	0.566	0.533	0.480	0.644	0.520	0.303
STRIDES(Gemini-2.5-Pro)	0.621	0.646	0.762	0.578	0.500	0.740	0.600	0.233	+24.9%
w/o MAS	0.497	0.477	0.509	0.477	0.508	0.594	0.475	0.427
STRIDES(Gemini-3-Flash)	0.583	0.525	0.638	0.586	0.460	0.750	0.530	0.547	+25.1%
w/o MAS	0.466	0.445	0.477	0.477	0.434	0.552	0.445	0.393
STRIDES(Grok-4)	0.580	0.550	0.657	0.591	0.436	0.770	0.550	0.327	+25.0%
w/o MAS	0.464	0.424	0.514	0.445	0.462	0.572	0.420	0.340
STRIDES(GPT-4.1)	0.570	0.541	0.646	0.580	0.430	0.756	0.540	0.320	+25.0%
w/o MAS	0.456	0.445	0.498	0.430	0.458	0.572	0.445	0.243
STRIDES(Qwen3-235B-A22B)	0.569	0.538	0.638	0.544	0.534	0.712	0.525	0.373	+18.0%
w/o MAS	0.482	0.456	0.541	0.461	0.452	0.604	0.445	0.317
STRIDES(GPT-OSS-120B)	0.553	0.553	0.611	0.512	0.436	0.726	0.550	0.413	+24.8%
w/o MAS	0.443	0.435	0.440	0.403	0.466	0.572	0.415	0.377
STRIDES(Kimi-K2)	0.519	0.523	0.555	0.484	0.424	0.674	0.480	0.420	+25.1%
w/o MAS	0.415	0.361	0.445	0.414	0.438	0.508	0.360	0.340
STRIDES(DeepSeek-v3.2)	0.489	0.475	0.519	0.497	0.436	0.648	0.435	0.270	+15.1%
w/o MAS	0.425	0.373	0.458	0.392	0.492	0.530	0.375	0.353

STRIDES強化LLMsは84のサブメトリクスの大半でバニラと比較して高いスコアを示し、最終スコアの向上が一貫している。
GPT-5.1はバニラモデルの中で最高の最終スコアを達成するが、STRIDESは最も強い総合改善をもたらす。
STRIDESの改善は複数のモデルで最終スコアを約25%程度相対的に向上させる。
モデルタイプやCore Independent Variableなどの主要フィールドで分野横断の抽出は高い一致と有意なCohen's kappaを示す。
ケーススタディは長文論文から完全な実験結果を回収するためにHITLの裁定が重要であり、ベンチマークの信頼性を高める。

Figure 2: Overview of our proposed system. The system has three stages. (1) Benchmark construction uses a Human-in-the-Loop MAS: a coordinator schedules a Paper Interpreter and Causal Designer to produce a draft causal design, a Verifier checks and routes low-quality designs for human review, and a

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。