QUICK REVIEW

[论文解读] Generalizing Skills with Semi-Supervised Reinforcement Learning

Chelsea Finn, Tianhe Yu|arXiv (Cornell University)|Dec 1, 2016

Reinforcement Learning in Robotics参考文献 34被引用 34

一句话总结

本文提出半监督强化学习（SSRL），使智能体能够将从带标签环境（奖励可用）中学到的策略泛化到无标签、真实世界环境中（无奖励信号）。通过在带标签MDP中利用过往经验进行逆强化学习，以推断无标签MDP中的奖励函数，所提出的S3G方法提升了策略泛化性能，在视觉输入的连续控制任务中优于标准强化学习与监督奖励回归。

ABSTRACT

Deep reinforcement learning (RL) can acquire complex behaviors from low-level inputs, such as images. However, real-world applications of such methods require generalizing to the vast variability of the real world. Deep networks are known to achieve remarkable generalization when provided with massive amounts of labeled data, but can we provide this breadth of experience to an RL agent, such as a robot? The robot might continuously learn as it explores the world around it, even while deployed. However, this learning requires access to a reward function, which is often hard to measure in real-world domains, where the reward could depend on, for example, unknown positions of objects or the emotional state of the user. Conversely, it is often quite practical to provide the agent with reward functions in a limited set of situations, such as when a human supervisor is present or in a controlled setting. Can we make use of this limited supervision, and still benefit from the breadth of experience an agent might collect on its own? In this paper, we formalize this problem as semisupervised reinforcement learning, where the reward function can only be evaluated in a set of "labeled" MDPs, and the agent must generalize its behavior to the wide range of states it might encounter in a set of "unlabeled" MDPs, by using experience from both settings. Our proposed method infers the task objective in the unlabeled MDPs through an algorithm that resembles inverse RL, using the agent's own prior experience in the labeled MDPs as a kind of demonstration of optimal behavior. We evaluate our method on challenging tasks that require control directly from images, and show that our approach can improve the generalization of a learned deep neural network policy by using experience for which no reward function is available. We also show that our method outperforms direct supervised learning of the reward.

研究动机与目标

解决在真实世界环境中泛化策略的挑战，尽管在带奖励的带标签环境中已有学习经验。
在机器人学及其他领域实现终身强化学习，其中持续收集真实世界经验，但奖励信号稀疏或难以获取。
形式化一种新的学习范式——半监督强化学习（SSRL），即智能体从带标签（奖励可用）和无标签（奖励不可用）环境的混合中学习。
通过利用无标签经验不仅用于策略学习，还通过逆强化学习（IRL）塑造奖励函数，从而提升策略泛化能力。

提出的方法

该方法将半监督强化学习（SSRL）建模为一种设置：智能体在少量带标签MDP（已知奖励）上训练策略，并需泛化到更大规模的无标签MDP（无奖励）上。
利用逆强化学习（IRL）从智能体在带标签MDP中的自身示范行为中推断无标签MDP中的奖励函数。
所推断的奖励函数用于在无标签环境中训练策略，实现在无直接奖励监督下的泛化。
该方法结合了来自带标签MDP的监督模仿学习与来自无标签MDP的自监督奖励推断，通过联合优化策略与奖励函数实现。
对于视觉任务，视觉特征在带标签MDP中通过强化学习预训练，随后用于初始化无标签设置中的策略网络与奖励网络。
该方法通过端到端微调与固定视觉特征进行评估，表明对特征适应具有鲁棒性。

实验结果

研究问题

RQ1智能体能否将从少数带标签环境中学习到的策略泛化到广泛存在的无标签、真实世界环境中（奖励不可用）？
RQ2利用带标签MDP中的过往经验作为示范，逆强化学习能否有效推断无标签MDP中的奖励函数，从而提升策略泛化能力？
RQ3利用无标签经验来塑造奖励函数，是否能带来比仅用于策略训练或通过监督回归获取奖励更好的泛化效果？
RQ4在缺乏完整奖励监督的情况下，当样本效率与模型容量受限时，所推断的奖励函数能否优于真实奖励函数？

主要发现

S3G在所有评估任务中均优于仅使用带标签数据的标准强化学习策略训练，包括障碍物导航、双连杆抓取器和半鼠猎手任务，证明了对未见状态变化的泛化能力提升。
在双连杆抓取器视觉任务中，S3G达到92%的成功率，超过监督奖励回归的85%和标准强化学习的69%，表明基于逆强化学习的奖励塑造具有优势。
在障碍物导航任务中，S3G达到79%的成功率，超过标准强化学习的65%和监督奖励回归的29%，表明从经验中推断奖励可提升泛化性能。
在特定条件下，S3G甚至超过了在双连杆抓取器任务中的“理想”性能（80%），表明在数据和模型容量有限时，所推断的奖励函数可能比真实奖励更优。
即使在视觉特征冻结的情况下，该方法仍表现出良好的泛化能力，表明在带标签环境中通过强化学习学习到的表征对无标签环境具有鲁棒性与可迁移性。
结果表明，在数据与计算资源受限时，通过过往经验进行逆强化学习的奖励塑造，可能比直接监督回归奖励更有效。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。