QUICK REVIEW

[论文解读] A unified strategy for implementing curiosity and empowerment driven reinforcement learning

Ildefons Magrans de Abril, Ryota Kanai|arXiv (Cornell University)|Jun 18, 2018

Reinforcement Learning in Robotics参考文献 7被引用 24

一句话总结

本文提出了一种统一框架，通过建模智能体与环境之间的信息流，将好奇心和赋能作为内在动机进行整合。利用共享的前向模型，该框架从好奇心中推导出稳态驱动机制，并高效计算赋能，从而实现更高效的探索与控制，提升样本效率并拓展自主行为的范围。

ABSTRACT

Although there are many approaches to implement intrinsically motivated artificial agents, the combined usage of multiple intrinsic drives remains still a relatively unexplored research area. Specifically, we hypothesize that a mechanism capable of quantifying and controlling the evolution of the information flow between the agent and the environment could be the fundamental component for implementing a higher degree of autonomy into artificial intelligent agents. This paper propose a unified strategy for implementing two semantically orthogonal intrinsic motivations: curiosity and empowerment. Curiosity reward informs the agent about the relevance of a recent agent action, whereas empowerment is implemented as the opposite information flow from the agent to the environment that quantifies the agent's potential of controlling its own future. We show that an additional homeostatic drive is derived from the curiosity reward, which generalizes and enhances the information gain of a classical curious/heterostatic reinforcement learning agent. We show how a shared internal model by curiosity and empowerment facilitates a more efficient training of the empowerment function. Finally, we discuss future directions for further leveraging the interplay between these two intrinsic rewards.

研究动机与目标

解决强化学习智能体中多种内在动机整合研究不足的问题。
开发一种统一框架，将智能体与环境之间的信息流建模为内在动机的核心机制。
展示如何通过共享的前向模型，从同一信息论基础中推导出好奇心与赋能。
证明结合好奇心（来自环境的信息增益）与赋能（对环境的控制潜力）可提升学习效率与行为多样性。
引入一种源自好奇心的稳态驱动机制，以增强探索能力，超越传统基于好奇心的方法。

提出的方法

将好奇心形式化为从环境到智能体的信息增益，通过前向模型中的预测误差进行量化。
将赋能定义为从智能体到环境的信息流，通过确定性策略下未来状态分布的熵来衡量。
使用共享的深度神经网络作为前向模型，从状态-动作对预测下一状态观测，降低计算成本。
通过引入参数 α 平衡探索与稳定性，从好奇心中推导出稳态驱动机制，推广经典好奇心方法。
采用 DDPG 进行策略优化，利用组合的内在奖励在三室导航环境中训练智能体。
应用变分推断与信息论原理，高效近似实现好奇心与赋能。

实验结果

研究问题

RQ1如何在强化学习中，通过单一信息论框架统一好奇心与赋能？
RQ2共享的前向模型是否能提升学习好奇心与赋能的样本效率？
RQ3所推导的稳态驱动机制是否相比纯好奇心驱动方法能增强探索能力？
RQ4未来选择数量（赋能）与控制精度之间的权衡如何影响策略学习？
RQ5好奇心与赋能之间的相互作用是否能导致更丰富多样的自主智能体行为？

主要发现

所提方法在三室环境中（随机初始位置）相比纯好奇心驱动的智能体，实现了更优越的探索行为。
源自好奇心的稳态驱动机制推广并增强了经典好奇心方法，通过参数 α 平衡探索与稳定性。
赋能的近似方法成功识别出高控制力状态——如靠近门的位置——在这些位置智能体拥有最大的未来选择。
共享前向模型的使用降低了计算成本，并提升了好奇心与赋能函数的训练效率。
联合内在奖励框架使智能体能够发现平衡信息获取与控制潜力的行为。
使用组合奖励的 DDPG 策略优化，生成了稳定且高效的控制策略，同时最大化信息增益与未来控制能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。