QUICK REVIEW

[论文解读] Close Enough? A Large-Scale Exploration of Non-Experimental Approaches to Advertising Measurement

Brett R. Gordon, Robert Moakler|arXiv (Cornell University)|Jan 18, 2022

Advanced Causal Inference Techniques被引用 25

一句话总结

本论文在663个Facebook广告实验中评估两种非实验性因果方法（DML 和 SPSM），以评估它们是否能可靠地恢复广告引发的提升；两种方法都未能完全成功，DML表现较好但仍存在偏差。

ABSTRACT

Despite their popularity, randomized controlled trials (RCTs) are not always available for the purposes of advertising measurement. Non-experimental data is thus required. However, Facebook and other ad platforms use complex and evolving processes to select ads for users. Therefore, successful non-experimental approaches need to "undo" this selection. We analyze 663 large-scale experiments at Facebook to investigate whether this is possible with the data typically logged at large ad platforms. With access to over 5,000 user-level features, these data are richer than what most advertisers or their measurement partners can access. We investigate how accurately two non-experimental methods -- double/debiased machine learning (DML) and stratified propensity score matching (SPSM) -- can recover the experimental effects. Although DML performs better than SPSM, neither method performs well, even using flexible deep learning models to implement the propensity and outcome models. The median RCT lifts are 29%, 18%, and 5% for the upper, middle, and lower funnel outcomes, respectively. Using DML (SPSM), the median lift by funnel is 83% (173%), 58% (176%), and 24% (64%), respectively, indicating significant relative measurement errors. We further characterize the circumstances under which each method performs comparatively better. Overall, despite having access to large-scale experiments and rich user-level data, we are unable to reliably estimate an ad campaign's causal effect.

研究动机与目标

评估来自大型广告平台的非实验数据是否能在不进行随机对照试验（RCT）的情况下恢复因果广告效果。
在此情境下比较双重/去偏机器学习（DML）和分层倾向评分匹配（SPSM）。
描述每种方法相对表现较好或较差的条件。
讨论阻碍在线广告可靠因果估计的数据与平台限制。

提出的方法

应用双重/去偏机器学习（DML）来估计因果效应，使用丰富的特征集和交叉验证的正交化以减少正则化偏差。
评估以深度学习为基础的倾向评分模型的分层倾向评分匹配（SPSM）。
使用广泛的活动级和用户级特征集以满足无混淆性假设。
利用663个Facebook广告实验及大规模的用户曝光数据与RCT进行基准比较。
报告按漏斗分组的中位提升以及DML与SPSM之间的比较偏差。

实验结果

研究问题

RQ1在平台记录数据上，非实验方法是否足以抵消广告投放选择，从而恢复因果效果？
RQ2在大规模Facebook实验中，DML和SPSM相对于随机对照试验的表现如何？
RQ3在何种实验条件（漏斗阶段、活动类型）下这些方法表现更好或更差？
RQ4为实现可靠的非实验广告测量，需要哪些数据/日志记录方面的改进？

主要发现

尽管特征和建模丰富，SPSM 相对于RCT基准表现不佳。
平均而言，DML 的向上偏差低于SPSM，但剩余偏差仍然相当大。
按漏斗分组的RCT中位提升：上层29%，中层18%，下层5%。
使用DML（和SPSM），按漏斗的中位提升：上层83%（173%），中层58%（176%），下层24%（64%），这表明存在较大的相对测量误差。
前瞻性营销活动和较小的基线转化率往往会产生相对更好的非实验估计。
更大的样本量、较高的测试曝光份额，以及更好的倾向模型表现可以提高非实验估计，但差距仍然存在。
总体而言，基于可用数据的非实验方法不能可靠地估计因果广告效果；需要来自RCT的外生变异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。