QUICK REVIEW

[论文解读] AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta|arXiv (Cornell University)|Jun 16, 2020

Reinforcement Learning in Robotics参考文献 60被引用 71

一句话总结

AWAC 是一种离线数据学习的离线/策略学习算法，通过对执行者施加隐式约束，在离线数据集上学习并通过在线微调实现高效细化，使从展示或次优数据中快速获得技能成为可能。

ABSTRACT

Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online learning process, such applications could be made substantially more practical: the prior data would provide a starting point that mitigates challenges due to exploration and sample complexity, while the online training enables the agent to perfect the desired skill. Such prior data could either constitute expert demonstrations or sub-optimal prior data that illustrates potentially useful transitions. While a number of prior methods have either used optimal demonstrations to bootstrap RL, or have used sub-optimal data to train purely offline, it remains exceptionally difficult to train a policy with offline data and actually continue to improve it further with online RL. In this paper we analyze why this problem is so challenging, and propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies. We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. We demonstrate these benefits on simulated and real-world robotics domains, including dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and rotating a valve. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.

研究动机与目标

通过高效地利用大规模离线数据集来对现实世界机器人学进行策略的预训练，推动实际应用的强化学习。
开发一种简单、数据高效的算法，将离线预训练与在线微调结合起来，而不需要显式的行为策略建模。
证明引入先验数据可以减少跨越多样化机器人任务的在线训练时间。
评估对次优离线数据的鲁棒性并展示在真实世界的适用性。

提出的方法

通过 TD 引导的离策略评估 critic 来估计 Q^π( s, a )。
在类似 KL 的隐式约束下，通过最大化 A^π_k(s,a) 进行策略改进，而不需要显式行为模型。
推导闭式非参数演员解 π*(a|s) ∝ π_β(a|s) exp(A^π_k(s,a)/λ) 并使用前向 KL 最小化将其投影到参数化策略。
用神经网络对演员和评论家进行参数化，并通过带有 learned critic 的优势进行的类似监督的加权最大似然更新（Eq. 13）。
使用包含离线数据 β 和在线数据的回放缓冲区；在线数据在离线步骤后稀疏引入。
与 AWR 和 ABM/MPO 类方法进行比较，以展示 TD 引导和无显式行为模型的好处。

实验结果

研究问题

RQ1AWAC 能否有效地将离线预训练与在线微调结合起来，以学习复杂的机器人控制任务？
RQ2与演示相比，次优或随机离线数据下 AWAC 的表现如何？
RQ3避免显式行为建模是否提高在线微调的效率和稳定性？
RQ4在高维、稀疏奖励的机器人任务上，AWAC 与先前的离线和在线 RL 方法有何比较？

主要发现

AWAC 能在多样化的机器人任务上实现从离线数据到在线微调的快速学习，包括灵巧操作和真实世界实验。
AWAC 在微调效率上优于纯离线或纯在线基线，使用有限的在线数据即可解决具有挑战性的任务（如笔任务的 120K 时间步）。
该方法可以在不改变算法的情况下利用演示、次优数据或随机探索数据，仍然实现减少对在线数据的需求。
避免显式行为策略建模使 AWAC 相比于先前的离线 RL 方法不那么保守，更有效地进行在线细化。
对评论家的 TD 引导和对演员的隐式约束是关键设计选择，带来比缺少这些特征的变体更好的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。