[論文レビュー] How Much Should We Trust Instrumental Variable Estimates in Political Science? Practical Advice Based on Over 60 Replicated Studies
The paper replicates 67 IV-based studies from APSR, AJPS, and JOP (2010–2022) to assess instrument strength, inference validity, and biases, offering a practical checklist and software to improve IV practice.
Instrumental variable (IV) strategies are widely used in political science to establish causal relationships. However, the identifying assumptions required by an IV design are demanding, and it remains challenging for researchers to assess their validity. In this paper, we replicate 67 papers published in three top journals in political science during 2010-2022 and identify several troubling patterns. First, researchers often overestimate the strength of their IVs due to non-i.i.d. errors, such as a clustering structure. Second, the most commonly used t-test for the two-stage-least-squares (2SLS) estimates often severely underestimates uncertainty. Using more robust inferential methods, we find that around 19-30% of the 2SLS estimates in our sample are underpowered. Third, in the majority of the replicated studies, the 2SLS estimates are much larger than the ordinary-least-squares estimates, and their ratio is negatively correlated with the strength of the IVs in studies where the IVs are not experimentally generated, suggesting potential violations of unconfoundedness or the exclusion restriction. To help researchers avoid these pitfalls, we provide a checklist for better practice.
研究の動機と目的
- Assess how IV designs are implemented in major political science journals from 2010–2022.
- Quantify patterns of instrument strength, inference validity, and bias in replicated studies.
- Provide a practical checklist and software to improve IV research practices.
- Highlight how weak instruments and assumption violations interact and propose remedies to bolster credible causal inference.
提案手法
- Systematically replicate 67 IV results across 70 designs from 2010–2022 in APSR, AJPS, and JOP.
- Compute first-stage F-statistics under multiple SE specifications (analytic, robust, cluster-robust, bootstrap).
- Assess inference validity using AR tests, $tF$ tests, bootstrap methods, and effective F statistics for weak instruments.
- Compare 2SLS estimates to OLS estimates to gauge bias amplification and strength of the first stage.
- Categorize instruments into types (Experiment, Rules/Policy, Theory-based, Weather/Geography, Econometric) and analyze their prevalence and implications.

実験結果
リサーチクエスチョン
- RQ1How often do published IV studies in top political science journals rely on strong versus weak instruments when evaluated with robust inference procedures?
- RQ2Do 2SLS estimates systematically differ in magnitude from OLS estimates, and how does this relate to instrument strength?
- RQ3What are the common types of instruments used, and how might these choices affect validity of causal claims?
- RQ4What practical steps (inference methods, diagnostics) improve reliability of IV estimates in political science?
- RQ5How replicable are IV findings given data/code availability and documentation?
主な発見
- Among 70 IV designs, 11% rely on weak instruments when using effective F-statistics.
- Using analytic SEs and traditional t-tests often understate uncertainty in IV estimates.
- 17–35% of designs cannot reject the null of no effect at 5% when using AR, bootstrap, or tF-based tests, compared with 10% under originally reported SEs/p-values.
- In 68 of 70 designs (97%), the 2SLS estimates are larger in magnitude than the naïve OLS estimates, with 24 designs (34%) at least five times larger.
- The ratio of 2SLS to OLS magnitudes is strongly negatively correlated with first-stage strength in non-experimental designs, suggesting weak instruments and possible violations of unconfoundedness or exclusion restrictions.
- Most IV designs are theory-based instruments (≈60%), followed by weather/geography and policy changes; experiments constitute 17.1%.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。