QUICK REVIEW

[论文解读] Bridging the Gap Between Value and Policy Based Reinforcement Learning

Ofir Nachum, Mohammad Norouzi|arXiv (Cornell University)|Feb 28, 2017

Reinforcement Learning in Robotics被引用 228

一句话总结

引入 Path Consistency Learning (PCL) 和 Unified PCL，将熵正则化的策略优化与 Softmax 值的一致性连接起来，从而实现稳定的离策略训练以及一个统一的类似 Actor-Critic 的模型。

ABSTRACT

We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic. The experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.

研究动机与目标

通过熵正则化的 softmax 一致性，将基于值的 RL 与基于策略的 RL 联系起来。
提出一个轨迹层面的（多步）一致性目标，支持离策略数据。
提出算法（PCL 和 Unified PCL），实现策略和价值的联合学习或在统一模型中的学习。
在基准任务上展示相对于强基线的经验改进。

提出的方法

定义最优策略与状态值之间的 softmax（熵正则化）时间一致性。
推导一个多步路径一致性目标 C(s_i:i+d, θ, φ)，用于衡量偏离零的程度。
通过对策略和价值参数的梯度更新，在采样的子轨迹上优化平方一致性误差。
通过回放缓冲区实现离策略数据，通过当前策略的滚动执行序列实现在策略数据。
提供 Unified PCL，它通过单一模型 ρ 来参数化策略和价值，以及相应的 V_ρ 和 π_ρ。
将 PCL 与标准的 actor-critic 和 Q-learning 联系起来，显示它同时推广两者。

实验结果

研究问题

RQ1如何通过熵正则化的 softmax 时间一致性，将最优策略概率与 softmax 状态值联系起来？
RQ2多步路径一致性是否能实现稳定的离策略学习并将 actor-critic 与 Q-learning 统一？
RQ3单一模型是否足以同时表示策略与价值，Unified PCL 相对于 PCL 的表现如何？
RQ4在各基准测试中，PCL 和 Unified PCL 能带来哪些经验收益？

主要发现

Softmax 时间一致性在熵正则化（τ>0）下将最优策略概率与 softmax 状态值联系起来。
PCL 在多步轨迹上最小化路径级一致性误差，从而实现稳定的离策略学习。
Unified PCL 可以通过单一模型学习策略和价值，作为一种新的 actor-critic 范式。
在若干基准上，PCL 与 Unified PCL 的表现优于强基线的 actor-critic 和 Q-learning，专家轨迹进一步提升性能。
使用含离策略数据的回放缓冲区与路径一致性目标兼容，并取得有竞争力的结果。
PCL 在更难的任务上可以接近或超越 A3C 的性能，并在报道的实验中持续优于 DQN。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。