[论文解读] Doubly Robust Policy Evaluation and Learning
本文提出了一种在上下文Bandit设置中用于策略评估与学习的双重稳健(DR)方法,结合奖励建模与逆倾向得分,当任一模型准确时均可实现无偏估计。该方法降低了方差并提高了现有技术的准确性,实证结果表明在价值估计中平均RMSE降低了13.6%,且在策略优化方面表现更优。
We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice.
研究动机与目标
- 解决历史数据无法反映新策略动作分布时,上下文Bandit中准确策略评估的挑战。
- 克服直接方法(若奖励模型不佳则偏差高)与逆倾向得分方法(若行为策略模型不佳则方差高)的局限性。
- 构建统一框架,确保当奖励模型或行为策略模型任一准确时,均可实现无偏估计。
- 证明双重稳健方法在估计准确性与策略优化方面均优于现有方法。
提出的方法
- 将双重稳健估计技术应用于上下文Bandit策略评估,结合奖励模型与行为策略模型。
- 通过逆倾向得分与奖励模型预测的加权组合,形成当任一组分正确时均无偏的估计器。
- 将双重稳健估计器表述为:$\hat{V}_{\text{DR}} = \sum_i \frac{\mathbf{1}(a_i = a) \cdot r_i}{\hat{e}(a|x_i)} + \hat{\varrho}(x_i) \cdot \left(1 - \frac{\mathbf{1}(a_i = a)}{\hat{e}(a|x_i)} \right)$,其中$\hat{e}$为估计的行为策略,$\hat{\varrho}$为奖励模型。
- 通过在策略权重上进行梯度更新,采用直接损失最小化来优化策略,结合DR估计指导学习。
- 使用岭回归训练奖励模型$\hat{\varrho}(x)$,并通过逻辑回归或类似方法估计行为策略概率$\hat{e}(a|x)$。
- 在合成基准与Yahoo! News的大规模真实世界数据集上评估性能,与IPS和直接方法进行比较。
实验结果
研究问题
- RQ1当奖励模型或行为策略模型任一不准确时,双重稳健估计器是否能改善上下文Bandit中的策略价值估计?
- RQ2与逆倾向得分法和直接方法相比,双重稳健方法在偏差、方差与估计准确性方面表现如何?
- RQ3在实践中,使用双重稳健估计器是否能带来更好的策略优化?
- RQ4在非渐近设置下,模型质量对双重稳健估计器性能的影响如何?
主要发现
- 与逆倾向得分法相比,双重稳健估计器始终降低估计误差,在实验中平均RMSE降低13.6%。
- DR估计器的方差低于IPS,尤其在小样本数据集下更为显著,加速收敛至真实策略价值。
- 即使在奖励模型或行为策略模型之一被错误指定时,该方法仍保持低偏差,展现出强鲁棒性。
- 实证结果表明,基于DR的策略学习在策略优化方面优于IPS与直接方法,生成的策略性能更优。
- 在Yahoo! News大规模真实世界数据集上的实验中,DR在价值估计准确性方面取得显著提升,尤其在低数据场景下表现突出。
- 理论分析表明,DR估计器的偏差与方差取决于两个模型对真实值的偏离程度,为理解其性能提供了严谨依据。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。