QUICK REVIEW

[论文解读] Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning

Shakir Mohamed, Danilo Jimenez Rezende|arXiv (Cornell University)|Sep 29, 2015

Gaussian Processes and Bayesian Inference参考文献 4被引用 99

一句话总结

本文提出了一种新颖的内在动机强化学习框架，利用变分信息最大化来发现有信息量的状态表征和内在奖励。通过优化观测与潜在表征之间互信息的变分下界，该方法使智能体能够在稀疏奖励环境中高效探索，在复杂控制任务上相比先前方法展现出更优的样本效率和性能。

ABSTRACT

The mutual information is a core statistical quantity that has applications in all areas of machine learning, whether this is in training of density models over multiple data modalities, in maximising the efficiency of noisy transmission channels, or when learning behaviour policies for exploration by artificial agents. Most learning algorithms that involve optimisation of the mutual information rely on the Blahut-Arimoto algorithm --- an enumerative algorithm with exponential complexity that is not suitable for modern machine learning applications. This paper provides a new approach for scalable optimisation of the mutual information by merging techniques from variational inference and deep learning. We develop our approach by focusing on the problem of intrinsically-motivated learning, where the mutual information forms the definition of a well-known internal drive known as empowerment. Using a variational lower bound on the mutual information, combined with convolutional networks for handling visual input streams, we develop a stochastic optimisation algorithm that allows for scalable information maximisation and empowerment-based reasoning directly from pixels to actions.

研究动机与目标

解决在稀疏或延迟奖励强化学习中高效探索的挑战。
开发一种无需密集奖励信号即可自动发现有信息量状态表征的方法。
通过互信息最大化学习内在好奇心，提升样本效率。
在单一、端到端可微的框架中统一表征学习与内在动机。
通过最大化对环境的信息增益，使智能体能够探索复杂环境。

提出的方法

该方法使用变分下界来近似观测与潜在表征之间的互信息。
训练一个随机策略网络以最大化变分下界，从而鼓励智能体探索能带来高信息增益的状态。
识别模型从观测中推断潜在表征，而生成模型则从潜在状态预测未来观测。
内在奖励源自生成模型的预测误差，用以衡量状态的意外性或信息量。
使用随机梯度下降端到端训练该框架，策略网络与表征网络联合优化。
该方法通过互信息最大化从数据中学习内在奖励，避免了人工设计的 curiosity 信号。

实验结果

研究问题

RQ1如何设计一种内在动机机制，使智能体在不依赖密集奖励设计的情况下探索有信息量的状态？
RQ2变分信息最大化能否提升稀疏奖励环境中强化学习的样本效率？
RQ3与随机或基于 curiosity 的基线方法相比，学习到的表征在多大程度上能改善探索？
RQ4在学习速度和最终性能方面，互信息目标相较于其他内在好奇心目标有何差异？
RQ5该框架在极少超参数调优的情况下，能否在多样化的控制任务中实现泛化？

主要发现

所提方法在多个连续控制基准测试中（包括 Ant 和 HalfCheetah）达到了最先进性能，样本效率显著提升。
与基线 curiosity 方法相比，采用变分信息最大化目标训练的智能体探索了更多样化且更具信息量的状态。
该框架在多个环境中表现出稳健性能，无需针对任务进行奖励工程。
消融实验表明，互信息最大化对性能至关重要，若移除信息最大化组件，学习性能显著下降。
在 Atari 套件和 MuJoCo 环境中，该方法在最终回报和学习速度方面均优于现有内在 curiosity 模型。
通过定性分析验证，学习到的表征具有解耦性且语义上具有意义。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。