[论文解读] Never Stop Learning: The Effectiveness of Fine-Tuning in Robotic Reinforcement Learning
该论文表明,对预训练的、离策略 RL 策略进行微调可以使基于视觉的机器人抓取适应新的背景、对象、光照和形态变化,所需数据量不到从零开始训练的 0.2%,并且超越基于 ImageNet 的预训练。
One of the great promises of robot learning systems is that they will be able to learn from their mistakes and continuously adapt to ever-changing environments. Despite this potential, most of the robot learning systems today are deployed as a fixed policy and they are not being adapted after their deployment. Can we efficiently adapt previously learned behaviors to new environments, objects and percepts in the real world? In this paper, we present a method and empirical evidence towards a robot learning framework that facilitates continuous adaption. In particular, we demonstrate how to adapt vision-based robotic manipulation policies to new variations by fine-tuning via off-policy reinforcement learning, including changes in background, object shape and appearance, lighting conditions, and robot morphology. Further, this adaptation uses less than 0.2% of the data necessary to learn the task from scratch. We find that our approach of adapting pre-trained policies leads to substantial performance gains over the course of fine-tuning, and that pre-training via RL is essential: training from scratch or adapting from supervised ImageNet features are both unsuccessful with such small amounts of data. We also find that these positive results hold in a limited continual learning setting, in which we repeatedly fine-tune a single lineage of policies using data from a succession of new tasks. Our empirical conclusions are consistently supported by experiments on simulated manipulation tasks, and by 52 unique fine-tuning experiments on a real robotic grasping system pre-trained on 580,000 grasps.
研究动机与目标
- 通过微调离策略 RL,在新变异下演示如何将基于视觉的机器人操作策略进行自适应。
- 量化通过微调获得的数据效率与性能提升,相较从零开始训练或使用 ImageNet 特征。
- 评估在多样环境和形态变化下,预训练策略的鲁棒性。
- 研究通过在连续任务上对单一策略进行重复微调来实现持续学习。
提出的方法
- 在 580,000 real grasp attempts across diverse objects 上对基于视觉的抓取策略(QT-Opt)进行预训练。
- 在六个具挑战性的改动下评估基础策略(背景、光照、夹具形状、机器人形态、未见透明对象)。
- 提出一种简单的离线微调程序,该程序从预训练策略初始化,并结合基础任务数据与目标任务数据来学习目标任务。
- 为目标任务收集离线探索数据(最多 800 次抓取),并使用来自基础任务和目标任务的数据以较小的学习率更新策略。
- 在目标任务上对微调后的性能进行评估,并与 Scratch 和 ImageNet 基线进行比较。
- 通过对多个任务逐次进行微调来开展持续学习实验,并衡量迁移与稳定性。
实验结果
研究问题
- RQ1预训练的离策略 RL 策略在有限的新数据下如何适应大量任务和环境变化?
- RQ2RL 基础预训练是否必要,还是在机器人领域进行快速微调时,监督的 ImageNet 预训练就足够?
- RQ3离线微调是否能在任务序列中实现持续学习,且对性能降幅最小?
主要发现
- 离策略 RL 的微调在相对较小的数据集上就能在所有挑战任务中实现显著的性能提升(最低可达到 25 次探索抓取)。
- 在 Checkerboard Backing、Harsh Lighting、Transparent Bottles 等任务中,使用 RL 微调优于 Scratch 与 ImageNet 预训练基线。
- 该方法在 base 任务上仅需约 0.2% 的数据就能达到近乎状态的性能。
- 在持续学习中,顺序微调相比单次微调通常只带来 4–7 个百分点的性能损失。
- 基于 RL 的预训练带来图像处理层的参数变化比基于 ImageNet 的预训练更大,表明对新感知运动任务的有效适应。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。