QUICK REVIEW

[论文解读] Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions

Rui Wang, Joel Lehman|arXiv (Cornell University)|Mar 19, 2020

Reinforcement Learning in Robotics参考文献 61被引用 43

一句话总结

增强版 POET 在 POET 的基础上引入领域通用的新颖性测量、高效的目标切换、基于 CPPN 的环境编码，以及一个新的开放端性进展度量，展示了更强的开放端强化学习。

ABSTRACT

Creating open-ended algorithms, which generate their own never-ending stream of novel and appropriately challenging learning opportunities, could help to automate and accelerate progress in machine learning. A recent step in this direction is the Paired Open-Ended Trailblazer (POET), an algorithm that generates and solves its own challenges, and allows solutions to goal-switch between challenges to avoid local optima. However, the original POET was unable to demonstrate its full creative potential because of limitations of the algorithm itself and because of external issues including a limited problem space and lack of a universal progress measure. Importantly, both limitations pose impediments not only for POET, but for the pursuit of open-endedness in general. Here we introduce and empirically validate two new innovations to the original algorithm, as well as two external innovations designed to help elucidate its full potential. Together, these four advances enable the most open-ended algorithmic demonstration to date. The algorithmic innovations are (1) a domain-general measure of how meaningfully novel new challenges are, enabling the system to potentially create and solve interesting challenges endlessly, and (2) an efficient heuristic for determining when agents should goal-switch from one problem to another (helping open-ended search better scale). Outside the algorithm itself, to enable a more definitive demonstration of open-endedness, we introduce (3) a novel, more flexible way to encode environmental challenges, and (4) a generic measure of the extent to which a system continues to exhibit open-ended innovation. Enhanced POET produces a diverse range of sophisticated behaviors that solve a wide range of environmental challenges, many of which cannot be solved through other means.

研究动机与目标

通过创建和解决一个无限序列的学习挑战，推进开放端强化学习。
开发一个领域通用的新颖性度量，指导超越手工设计编码的环境发明。
提高在环境之间传递解的效率，以维持可扩展的开放端搜索。
引入更具表现力的环境编码，以生成多样且复杂的环境。
提出并验证开放端进展的定量度量（ANNECS）以跟踪持续创新。

提出的方法

引入 PATA-EC（Performance of All Transferred Agents Environment Characterization），根据所有代理在一个环境中的表现来量化新颖性。
用领域通用距离度量替代领域特定的新颖性；对新颖性评估进行归一化并使用欧几里得距离。
通过要求最近五个现任分数中的最大值被超越来实现传递，减少噪声和计算量。
采用基于 CPPN 的环境编码（通过 NEAT）生成比手工障碍物更复杂多样的景观。
使用 ANNECS（累计创建并解决的新环境数量）来衡量一次运行中的持续开放端进展。
在一个用 CPPN 编码的障碍跑道的二维双足步行域中展示增强版 POET，评估多样性、传递效率和开放端性指标。

实验结果

研究问题

RQ1是否一个领域通用的新颖性度量（PATA-EC）能够在跨领域有效指导创造有意义且多样的环境？
RQ2改进的传递机制是否在减少计算量的同时维持或提高解的发现？
RQ3基于 CPPN 的环境编码是否能够产生比手工编码更丰富多样的开放端环境？
RQ4ANNECS 度量是否能可靠反映增强版 POET 的持续开放端进展？
RQ5在更具表现力的域中，增强版 POET 在多大程度上能够展示超越原始 POET 的开放端学习？

主要发现

PATA-EC 实现领域通用的环境新颖性，在测试域中实现了与手工新颖性可比的多样性，计算量在进化策略（ES）步骤中约增加 82%。
改进的传递策略将计算量降至原始 POET 成本的约 79.7%，同时保持多样性和解决问题的能力。
CPPN 基于环境编码产生了多样的障碍物配置，产生的环境在质的方面比手工编码更丰富。
采用 CPPN 编码的增强版 POET 显示出深层次、分层嵌套的环境系谱，指示开放端探索。
对照实验表明 POET-可解环境需要自生课程；直接优化和手动课程都不及 POET 的隐式课程。
ANNECS 随时间增加，表示在整个运行过程中持续生成并解决新环境。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。