[论文解读] COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning
COG 使用离线强化学习将面向任务的数据与大规模未标记的先前数据集融合在一起,使策略能够组合先前学到的行为来解决来自新初始条件的多步任务。
Reinforcement learning has been applied to a wide variety of robotics problems, but most of such applications involve collecting data from scratch for each new task. Since the amount of robot data we can collect for any single task is limited by time and cost considerations, the learned behavior is typically narrow: the policy can only execute the task in a handful of scenarios that it was trained on. What if there was a way to incorporate a large amount of prior data, either from previously solved tasks or from unsupervised or undirected environment interaction, to extend and generalize learned behaviors? While most prior work on extending robotic skills using pre-collected data focuses on building explicit hierarchies or skill decompositions, we show in this paper that we can reuse prior data to extend new skills simply through dynamic programming. We show that even when the prior data does not actually succeed at solving the new task, it can still be utilized for learning a better policy, by providing the agent with a broader understanding of the mechanics of its environment. We demonstrate the effectiveness of our approach by chaining together several behaviors seen in prior datasets for solving a new task, with our hardest experimental setting involving composing four robotic skills in a row: picking, placing, drawer opening, and grasping, where a +1/0 sparse reward is provided only on task completion. We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands, and present results in both simulated and real world domains. Additional materials and source code can be found on our project website: https://sites.google.com/view/cog-rl
研究动机与目标
- 动机:说明先前的、与任务无关的数据如何扩展机器人政策的泛化能力。
- 提出一种简单的数据驱动方法,通过离线强化学习在没有显式层次结构的情况下将行为拼接在一起。
- 演示先前数据在从未见过的初始条件下学习新且多阶段任务中的帮助。
- 展示使用离线数据和稀疏奖励实现从视觉观测到底层控制的端到端学习。
提出的方法
- 将保守性Q学习(CQL)扩展,以在离线RL中同时纳入先前数据和任务特定数据。
- 用带有零奖励标记的先前数据初始化回放缓冲区,然后在混合的先前数据和任务数据上进行训练。
- 使用Q-learning动力学将来自任务奖励轨迹的价值传播到先前数据覆盖的区域。
- 离线训练后可选择使用有限的在线交互对离线策略进行微调。
- 训练端到端网络(卷积神经网络),将48x48或64x64图像和机器人状态映射到连续的6自由度动作以及离散的夹持控制。
实验结果
研究问题
- RQ1无模型的离线RL是否能够利用与任务无关的先验数据集来学习新技能?
- RQ2策略是否能够通过拼接先前数据中看到的行为来解决来自新初始条件的新任务?
- RQ3将先前数据用于离线RL与行为克隆基线在融合先前数据方面有何差异?
- RQ4在带有先前数据的离线学习之后,在线微调是否必要或有益?
- RQ5该方法在超出仿真、扩展到真实世界机器人系统中的泛化程度如何?
主要发现
- COG 通过组合抽屉打开、抓取和障碍物清除来解决多阶段任务,即使在数据中从未看到完整序列。
- 在仿真中对新初始条件,COG 超越行为克隆基线、SAC及消融结果。
- 在线微调在抽屉任务上进一步将成功率提高到超过90%,仅需相对适量的额外数据。
- 在真实世界的实验中,当抽屉起始为闭合时,该方法实现了7/8的成功率,优于BC-oracle基线。
- BC-init 无法解决未见初始条件,凸显了将离线数据整合到学习中的价值,而不仅仅是预训练。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。