QUICK REVIEW

[论文解读] Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Anirudh Goyal, Shagun Sodhani|arXiv (Cornell University)|Jun 25, 2019

Reinforcement Learning in Robotics参考文献 37被引用 23

一句话总结

该论文提出了一种去中心化的强化学习框架，其中低层次的行为原语基于其信息需求自主竞争以执行动作，利用信息论机制为每个状态选择最相关的原语。通过限制每个原语的信息访问并借助竞争促进专业化，该方法在无需集中式元控制器的情况下，相比分层和扁平策略，实现了更优的泛化和迁移性能。

ABSTRACT

Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.

研究动机与目标

解决由必须理解完整状态空间的集中式元策略导致的分层强化学习中的泛化瓶颈问题。
通过消除对单一高层控制器的依赖，实现灵活的即插即用技能迁移。
通过信息论正则化促进低层次原语之间的自然专业化与竞争。
通过去中心化原语选择，提升在未见或复杂环境中的迁移学习性能。

提出的方法

每个原语策略通过变分信息瓶颈目标进行训练，以限制其对当前状态的信息访问。
原语根据其请求的状态信息量进行竞争——信息请求量越高，越可能被选中。
系统采用可微分的端到端训练方案，使原语能够高效地学习编码相关状态特征。
架构被因子分解：原语独立训练，并在推理时通过基于信息的竞争动态选择。
该方法依赖GRU编码器处理序列观测，生成用于决策的状态表示。
最终策略是一个竞争性集成模型，仅最具有信息量的原语执行动作，且无显式的元策略。

实验结果

研究问题

RQ1具有信息受限原语的去中心化策略集成是否能在泛化性能上超越具有集中式元策略的分层策略？
RQ2原语之间的信息论竞争如何导致自然专业化并提升迁移学习性能？
RQ3原语在不重新训练的情况下，能在多大程度上被重新组合或迁移到新环境中？
RQ4消除高层控制器是否能提升在未见环境中的鲁棒性和适应性？

主要发现

所提方法在多样化环境（包括四房间网格世界和蚂蚁迷宫任务）中的泛化性能优于扁平策略和分层策略。
该模型实现了卓越的迁移性能，通过即插即用的方式重新组合原语，成功泛化到更大或此前未见过的环境。
原语自然地专业化于状态空间的不同区域，表现为对特定环境特征（如箱子、栅栏、球体）的选择性激活。
由于缺乏集中式元控制器，原语的无缝迁移与重组得以实现，显著增强了模块化与适应性。
基于信息的驱动竞争机制实现了无需显式监督的有效且动态的活跃原语选择。
在蚂蚁迷宫环境中，该方法成功实现了对3至10个目标位置的泛化，展示了其鲁棒性与可扩展性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。