QUICK REVIEW

[论文解读] Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Lars Buesing, Théophane Weber|arXiv (Cornell University)|Nov 15, 2018

Advanced Bandit Algorithms Research参考文献 23被引用 41

一句话总结

CF-GPS 通过在结构因果模型中对替代结果进行对照事实评估来从离策略数据学习策略，从而减少模型偏差并改善策略评估和搜索。

ABSTRACT

Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for model-based policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies in POMDPs from off-policy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual off-policy episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use of available logged data to de-bias model predictions. In contrast to off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a non-trivial grid-world task. Finally, we show that CF-GPS generalizes the previously proposed Guided Policy Search and that reparameterization-based algorithms such Stochastic Value Gradient can be interpreted as counterfactual methods.

研究动机与目标

激励在强化学习中使用对照事实推理以缓解来自纯合成数据的模型偏差。
使用结构因果模型在部分可观测马尔可夫决策过程（POMDPs）中形式化基于模型的强化学习。
引入对照事实策略评估及 CF-GPS 算法用于离策略策略学习。
展示 CF-GPS 与现有强化学习方法（如 GPS 和随机值梯度（SVG））之间的联系。
在一个部分可观测的类似 Sokoban 的任务上展示经验收益。

提出的方法

将 POMDP 环境表示为具有独立情景和确定性因果机制的结构因果模型。
通过从观测数据中推断噪声变量并执行干预以获得 do 查询，定义 SCM 中的对照事实推断。
提出 CF-PE：对照事实离策略评估，使用后验推断的情景在无模型失配假设下对离策略数据评估策略并给出无偏估计。
提出 CF-GPS：对照事实为基础的策略搜索，将模型滚动绑定到由离策略数据导出的对照事实分布以改进策略。
证明 CF-GPS 能推广引导策略搜索（GPS），并作为对照事实方法与随机值梯度（SVG）相关。
在 PO-SOKOBAN 上提供一个实验设置，以比较 CF-GPS 与 MB-PS 及类似 GPS 的基线。

实验结果

研究问题

RQ1在强化学习中，结构因果模型中的对照事实推理是否能够在从离策略数据学习时降低偏差？
RQ2在偏差和准确性方面，对照事实策略评估（CF-PE）与标准的基于模型的离策略评估相比如何？
RQ3在一个非平凡的部分观测任务上，CF-GPS 是否能在策略搜索性能上优于原生的基于模型的策略搜索和类似 GPS 的方法？
RQ4CF-GPS、GPS 与 SVG 方法之间有哪些理论与经验上的联系？
RQ5在使用真实日志数据时，CF-GPS 在何种条件下优于传统的基于模型的方法？

主要发现

CF-GPS 通过将基于模型的预测定地绑定到来自离策略数据的推断情景来改进策略评估和搜索。
对照事实评估在干预下能产生策略价值的无偏估计，在不假设模型不匹配的情况下。
在 PO-SOKOBAN 的策略评估和策略搜索任务中，CF-GPS 的表现优于 MB-PS 和类似 GPS 的基线。
将滚动绑定到对照事实分布有助于缓解模型不匹配并更好地利用日志数据。
建立的联系表明 GPS 对应于在完全观测的 MDP 中的 MB-PS 的对照事实版本，SVG 可以被视为对照事实方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。