QUICK REVIEW

[论文解读] Integration of Individual Participant and Aggregate Data Under Dataset Shift: Summary Statistic Comparison and Scalable Computation

Ming‐Yueh Huang, Jing Qin|arXiv (Cornell University)|Mar 2, 2026

Advanced Causal Inference Techniques被引用 0

一句话总结

论文比较不同聚合数据摘要在数据集分布变化下对 IPD–AD 集成效率的影响，并为可扩展分析引入了一种快速非迭代的 CMLE 算法。

ABSTRACT

Integrated IPD-AD analysis, which combines individual participant data (IPD) with aggregate data (AD), is increasingly recognized as an effective strategy for generating more reliable and generalizable inferences from heterogeneous studies. While most existing work has focused on algorithmic approaches, this paper investigates a complementary yet underexplored question: how different forms of AD influence the efficiency of data integration. Working within a constrained maximum likelihood estimation framework, we compare commonly reported summary statistics and show that subgroup-specific summaries can substantially improve estimation efficiency. In particular, we find that AD derived from outcome-stratified subgroups (e.g., cases and controls) consistently yield greater efficiency gains than those based on covariate-stratified subgroups (e.g., age or exposure categories), especially when the outcome is continuous. Although outcome-stratified summaries are commonly reported for discrete outcomes, they are rarely provided when the outcome is continuous. Our findings therefore support the routine inclusion of outcome-stratified summaries for continuous endpoints in trial reports and public data repositories to facilitate more efficient evidence synthesis. We further extend the constrained maximum likelihood framework to accommodate dataset shift and develop a fast, non-iterative estimation procedure to improve numerical stability and scalability. We illustrate the proposed methodology with two applications: an analysis of income data under covariate shift and an analysis of housing data under prior probability shift.

研究动机与目标

推动整合型 IPD–AD 分析以充分利用 IPD 同时利用可获得的 AD。
评估在受约束的极大似然框架内，不同形式的 AD 如何影响估计效率。
扩展 CMLE 以适应数据集分布变化（协变量与先验概率分布变化），并量化效率提升。
开发一种快速、非迭代的算法，以提高高维积分任务的数值稳定性和可扩展性。

提出的方法

使用受约束极大似然估计（CMLE）在无偏总体估计方程下将 IPD 与 AD 结合。
将 AD 表示为通过估计方程得到的参数估计，并在 CMLE 目标中施加相应约束。
在必要时通过对 AD 的正态近似项来增强 IPD 似然对 AD 的不确定性的处理。
用带偏样本密度比框架建模数据集分布变化，通过协变量和结果的变化将 IPD 与 AD 连接。
推导并利用一种快速非迭代算法在一步内获得 CMLE，从而提高稳定性和可扩展性。

Figure 1: The biases (top panel) and relative efficiencies (bottom panel) of the constrained maximum likelihood estimator for $\beta_{00}$ (left), $\beta_{01}$ (center), and $\beta_{02}$ (right), with various AD: $\widetilde{\boldsymbol{\phi}}^{Y}$ (solid line with $\circ$ ), $\widetilde{\boldsymbol

实验结果

研究问题

RQ1不同形式的聚合数据（边际均值、按协变量分层的摘要、按结果分层的摘要）如何影响 IPD–AD 集成的效率？
RQ2特别是对于连续结果，结果分层的摘要是否比其他 AD 形式提供系统的效率提升？
RQ3如何扩展 CMLE 以处理 IPD–AD 集成中的协变量分布变化和先验概率分布变化？
RQ4在数据集分布变化下，是否可以通过快速非迭代的估计过程实现稳定且可扩展的 CMLE？
RQ5在试验与数据存储库中报告结果分层摘要的实际意义是什么？

主要发现

结果分层的协变量摘要在与边际均值或按协变量分层摘要相比时，能够显著提升估计效率。
对于连续结果以及 AD 包含结果相关信息时，效率提升更为明显。
CMLE 通过在 IPD 与 AD 之间使用基于密度比的链接来处理数据集分布变化，适应协变量和先验概率分布变化。
一种快速、非迭代的算法能够在一步获得 CMLE，在高维 setting 中提升数值稳定性和可扩展性。
对 AD 的不确定性得到适当量化，渐近理论描述在 N/n → κ ∈ (0, ∞) 时的估计量。
该框架在协变量分布变化下以收入数据和在先验概率分布下的住房数据为例展示了实际应用的可行性。

Figure 2: The relative efficiencies of the constrained maximum likelihood estimator for $\beta_{00}$ (top row), $\beta_{01}$ (center row), and $\beta_{02}$ (bottom row) under IPD sample sizes $n=100$ (left column), $n=200$ (center column), and $n=400$ (right column), with various AD: $\widetilde{\bo

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。