[论文解读] Actor-Critic Pretraining for Proximal Policy Optimization
本文在 PPO 的训练中使用专家演示对 actor 和 critic 进行预训练,然后通过 PPO 微调,以提高机器人任务的样本效率。
Reinforcement learning (RL) actor-critic algorithms enable autonomous learning but often require a large number of environment interactions, which limits their applicability in robotics. Leveraging expert data can reduce the number of required environment interactions. A common approach is actor pretraining, where the actor network is initialized via behavioral cloning on expert demonstrations and subsequently fine-tuned with RL. In contrast, the initialization of the critic network has received little attention, despite its central role in policy optimization. This paper proposes a pretraining approach for actor-critic algorithms like Proximal Policy Optimization (PPO) that uses expert demonstrations to initialize both networks. The actor is pretrained via behavioral cloning, while the critic is pretrained using returns obtained from rollouts of the pretrained policy. The approach is evaluated on 15 simulated robotic manipulation and locomotion tasks. Experimental results show that actor-critic pretraining improves sample efficiency by 86.1% on average compared to no pretraining and by 30.9% to actor-only pretraining.
研究动机与目标
- 通过利用专家数据提升强化学习在机器人领域的样本效率的动机。
- 提出一种预训练方案,为 actor–critic 算法初始化 actor 和 critic 网络。
- 展示如何使用来自预训练策略的回报对 critic 进行预训练以辅助 actor 预训练。
提出的方法
- 通过对专家演示进行行为克隆来预训练 actor。
- 使用来自预训练策略回滚的回报来预训练 critic。
- 用 Proximal Policy Optimization (PPO) 对联合预训练的 actor–critic 进行微调。
- 将该方法应用于 PPO,在理论对齐的预训练框架内。
- 评估在多个基准任务上的样本效率与收敛性。

实验结果
研究问题
- RQ1在使用专家数据的情形下,联合预训练的 actor 与 critic 相较于没有预训练的 PPO,是否能提升样本效率?
- RQ2在 PPO 中,critic 预训练是否相对于仅 actor 预训练提供额外的收益?
- RQ3actor–critic 预训练如何影响在多样化的机器人操作与运动任务中的收敛性?
主要发现
- 相较于没有预训练, actor–critic 预训练使样本效率平均提升 86.1%。
- 相较于仅 actor 预训练,actor–critic 预训练使样本效率提升 30.9%。
- 该方法在 15 个模拟机器人操作与运动任务上进行评估,显示出改进的收敛性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。