QUICK REVIEW

[论文解读] Temporal Difference Models: Model-Free Deep RL for Model-Based Control

Vitchyr H. Pong, Shixiang Gu|arXiv (Cornell University)|Feb 25, 2018

Reinforcement Learning in Robotics参考文献 27被引用 44

一句话总结

简短总结：Temporal Difference Models (TDMs) 是一类以目标条件的价值函数，通过模型无关学习训练，充当规划的隐式模型，在模型无关渐近性下实现模型基的效率。

ABSTRACT

Model-free reinforcement learning (RL) is a powerful, general tool for learning complex behaviors. However, its sample efficiency is often impractically large for solving challenging real-world problems, even with off-policy algorithms such as Q-learning. A limiting factor in classic model-free RL is that the learning signal consists only of scalar rewards, ignoring much of the rich information contained in state transition tuples. Model-based RL uses this information, by training a predictive model, but often does not achieve the same asymptotic performance as model-free RL due to model bias. We introduce temporal difference models (TDMs), a family of goal-conditioned value functions that can be trained with model-free learning and used for model-based control. TDMs combine the benefits of model-free and model-based RL: they leverage the rich information in state transitions to learn very efficiently, while still attaining asymptotic performance that exceeds that of direct model-based RL methods. Our experimental results show that, on a range of continuous control tasks, TDMs provide a substantial improvement in efficiency compared to state-of-the-art model-based and model-free methods.

研究动机与目标

激发将模型基规划的效率与模型无关渐近性能相结合的必要性。
介绍 Temporal Difference Models (TDMs) 作为模型无关与模型基强化学习之间的桥梁。
展示 relabeling 和多步 horizon 如何实现对 TDMs 的高效 off-policy 学习。
证明 TDMs 在连续控制任务上实现了优越的样本效率和最终性能。

提出的方法

将 TDMs 定义为带有 horizon 参数 tau 的目标条件 Q 函数。
使用基于距离的奖励 D(s, s_g) 和一个考虑 horizon 的 Q-learning 递推式 Q(s, a, s_g, tau)。
用不同的目标 g 和 horizon tau 对经验进行重标注，以最大化数据效率。
通过使用学习到的 Q 函数进行类似模型预测控制的规划来提取策略，或通过直接基于 Q 的动作选择来提取。
可选地使用向量化的（每维）距离奖励以改善监督。
提供一个算法（Algorithm 1）用于带回放和目标网络的 off-policy 训练。

实验结果

研究问题

RQ1带有规划 horizon tau 的目标条件价值函数是否能在模型基和模型无关学习之间插值？
RQ2在连续控制任务中，temporal difference models 是否比纯模型基或纯模型无关方法具有更好的样本效率？
RQ3用不同目标和 horizon 进行重新标注是否能提高离策略学习中的数据效率？
RQ4如何在规划或直接控制中将 TDMs 用于实际的策略提取？
RQ5向量化距离奖励和 horizon 参数选择对性能的影响是什么？

主要发现

TDMs 在多个连续控制任务上显著优于最先进的模型无关方法的样本效率。
由于模型偏差减少，TDMs 在更困难任务的最终性能上超过纯模型基方法。
使用不同目标和 horizon 重新标注可带来显著的数据效率提升，使短期和长期行为的学习加速。
向量化的（按维度）距离奖励在样本效率上明显优于标量奖励。
TDMs 扩展到现实世界的机器人领域，在 Sawyer 7-DoF 臂上相较于 DDPG 显示了更高的学习效率。
消融表明 horizon tau 控制在模型基和模型无关范畴之间的插值，且向量化提升了学习效果相对于标量奖励。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。