QUICK REVIEW

[论文解读] Maximum Entropy-Regularized Multi-Goal Reinforcement Learning

Rui Zhao, Xudong Sun|arXiv (Cornell University)|May 21, 2019

Reinforcement Learning in Robotics被引用 47

一句话总结

简要概述：引入用于多目标强化学习的基于奖励加权的熵目标，以及基于最大熵的优先级（MEP），以从多样化的已实现目标中学习，在多目标机器人任务中提升性能和样本效率。

ABSTRACT

In Multi-Goal Reinforcement Learning, an agent learns to achieve multiple goals with a goal-conditioned policy. During learning, the agent first collects the trajectories into a replay buffer, and later these trajectories are selected randomly for replay. However, the achieved goals in the replay buffer are often biased towards the behavior policies. From a Bayesian perspective, when there is no prior knowledge about the target goal distribution, the agent should learn uniformly from diverse achieved goals. Therefore, we first propose a novel multi-goal RL objective based on weighted entropy. This objective encourages the agent to maximize the expected return, as well as to achieve more diverse goals. Secondly, we developed a maximum entropy-based prioritization framework to optimize the proposed objective. For evaluation of this framework, we combine it with Deep Deterministic Policy Gradient, both with or without Hindsight Experience Replay. On a set of multi-goal robotic tasks of OpenAI Gym, we compare our method with other baselines and show promising improvements in both performance and sample-efficiency.

研究动机与目标

在目标分布未知时，激发从多样化已实现目标中学习。
将最大熵与多目标强化学习结合，以减少回放时对行为策略的偏差。
推导一个安全的代理目标，使其下界原始的熵正则目标。
提供一个实用的优化框架（MEP），在提高目标多样性的同时最大化策略回报。

提出的方法

为多目标强化学习定义一个按累积奖励对轨迹加权的熵目标。
推导一个作为下界以稳定优化的安全代理目标。
引入一个基于目标轨迹密度模型的最大熵优先框架（MEP）。
用潜变量模型（高斯混合模型）对 p(tau^g) 建模，并利用互补密度构造提案分布 q(tau^g)。
结合离策略方法（带或不带 HER 的 DDPG）和 MEP 优先级来回放多样化的目标进行优化。
提供一个算法（MEP）并在 OpenAI Gym 的多目标机器人任务上展示改进。

实验结果

研究问题

RQ1通过 MEP 引入目标熵项是否能改善离策略的多目标强化学习方法（DDPG、DDPG+HER）？
RQ2MEP 是否提升机器人操作任务的样本效率和性能？
RQ3在训练过程中，MEP 如何影响已实现目标分布的熵？

主要发现

方法	推送 - 成功	推送 - 时间	拾放 - 成功	拾放 - 时间	滑动 - 成功	滑动 - 时间	蛋 - 成功	蛋 - 时间	块 - 成功	块 - 时间	笔 - 成功	笔 - 时间
DDPG	99.90%	5.52h	39.34%	5.61h	75.67%	5.47h	-	-	-	-	-	-
DDPG+PER	99.94%	30.66h	67.19%	25.73h	66.33%	25.85h	-	-	-	-	-	-
DDPG+MEP	99.96%	6.76h	76.02%	6.92h	76.77%	6.66h	-	-	-	-	-	-
DDPG+HER	76.19%	7.33h	20.32%	8.47h	27.28%	7.55h	-	-	-	-	-	-
DDPG+HER+PER	75.46%	79.86h	18.95%	80.72h	27.74%	81.17h	-	-	-	-	-	-
DDPG+HER+MEP	81.30%	17.00h	25.00%	19.88h	31.88%	25.36h	-	-	-	-	-	-

与基线相比，MEP 能加速收敛并在六个机器人任务上提高最终性能。
在某些任务中，MEP 实现了更快的训练并获得高达 39.34 百分点的性能提升。
相较于先前的 PER，MEP 降低了训练时间同时提供更强的性能。
在所有环境中平均样本效率提升约两倍。
在 MEP 训练期间，已实现目标分布的熵增加，验证了预期效果。
在手臂任务中，DDPG+MEP 的训练时间约为基线的1.2倍，而 DDPG+PER 约为5倍，突出 MEP 的计算效率优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。