[论文解读] Off-Policy Evaluation via Off-Policy Classification
该论文提出了一种新颖的离策略评估(OPE)方法,用于连续控制中稀疏二值奖励的深度强化学习,将OPE建模为正样本-未标记样本(PU)分类问题,以避免依赖环境模型或重要性采样。该方法在预测相对策略性能方面表现优异,尤其在基于图像的机器人操作任务中从仿真到现实的迁移中表现突出。
In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE of value-based methods, which are of particular interest in deep RL with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Furthermore, existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.
研究动机与目标
- 为解决真实世界深度强化学习应用中在线策略评估的高成本和不切实际问题。
- 为具有稀疏二值奖励的连续动作空间中的基于价值的方法开发一种可靠的离策略评估度量。
- 消除对环境模型或重要性采样的依赖,这些方法在高维观测(如图像)下不稳定或不可行。
- 实现在仿真环境中进行有效模型选择、超参数调优和早期停止,以减少真实世界部署前的试错成本。
- 提升从仿真到真实世界机器人操作任务中策略泛化性能的预测能力。
提出的方法
- 将离策略评估建模为正样本-未标记样本(PU)分类问题,其中示范轨迹被视为正样本,其他轨迹视为未标记样本。
- 使用分类器估计给定轨迹由目标策略生成的概率,利用从状态-动作对中提取的特征。
- 采用对比学习目标以提升分类器的特征表示质量,增强对在线策略与离线策略轨迹的区分能力。
- 通过仅依赖分类结果对轨迹进行相对排序,避免使用重要性采样和环境模型拟合。
- 在行为策略收集的在线策略与离线策略轨迹混合数据上训练分类器,仅使用性能的相对顺序作为监督信号。
- 最终的OPE得分由分类器对在线策略轨迹的预测概率得出,作为期望回报的代理指标。
实验结果
研究问题
- RQ1基于PU分类的度量能否在不使用环境模型或重要性采样的前提下,可靠估计策略的相对性能?
- RQ2该方法在基于图像的机器人操作任务中从仿真到现实的迁移中泛化能力如何?
- RQ3在高维连续控制设置下,该方法在与真实在线策略性能的相关性方面是否优于现有OPE基线?
- RQ4在连续动作空间中,面对分布偏移和稀疏二值奖励时,该方法的鲁棒性如何?
- RQ5该方法能否在真实世界强化学习部署流程中有效支持超参数调优和早期停止?
主要发现
- 所提出的基于PU的OPE方法在高维观测设置(如图像输入)下,与真实在线策略性能的相关性高于现有基线方法。
- 该方法在多种泛化场景下成功预测了相对策略性能,包括基于图像的机器人操作任务中从仿真到现实的迁移。
- 在处理连续动作空间和稀疏二值奖励时,其稳定性和准确性优于基于模型和重要性采样的OPE方法。
- 基于分类器的方法对分布偏移具有鲁棒性,即使行为策略与目标策略显著不同时,仍能保持可靠的性能。
- 该方法支持在仿真环境中进行有效的模型选择和早期停止,减少了对昂贵真实世界轨迹采样的依赖。
- 实证结果表明,PU分类度量在具有图像观测的表格型和连续控制环境中,与实际回报具有强相关性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。