QUICK REVIEW

[论文解读] VIME: Variational Information Maximizing Exploration

Rein Houthooft, Xi Chen|arXiv (Cornell University)|May 31, 2016

Reinforcement Learning in Robotics参考文献 39被引用 376

一句话总结

提出 VIME，一种针对连续控制的好奇心驱动的探索策略，利用贝叶斯神经网络中的变分推断来最大化对环境动力学的信息增益，从而在探索方面优于基于启发式方法。

ABSTRACT

Scalable and effective exploration remains a key challenge in reinforcement learning (RL). While there are methods with optimality guarantees in the setting of discrete state and action spaces, these methods cannot be applied in high-dimensional deep RL scenarios. As such, most contemporary RL relies on simple heuristics such as epsilon-greedy exploration or adding Gaussian noise to the controls. This paper introduces Variational Information Maximizing Exploration (VIME), an exploration strategy based on maximization of information gain about the agent's belief of environment dynamics. We propose a practical implementation, using variational inference in Bayesian neural networks which efficiently handles continuous state and action spaces. VIME modifies the MDP reward function, and can be applied with several different underlying RL algorithms. We demonstrate that VIME achieves significantly better performance compared to heuristic exploration methods across a variety of continuous control tasks and algorithms, including tasks with very sparse rewards.

研究动机与目标

解决高维连续强化学习环境中的探索问题。
最大化关于环境动力学的信息增益以引导探索。
使用带有贝叶斯神经网络的变分推断来计算内在奖励。
展示在多种 RL 算法和任务上的有效性，包括稀疏奖励。

提出的方法

将好奇心形式化为在给定历史的情况下，下一状态与动力学模型参数之间的互信息。
使用贝叶斯神经网络作为动力学模型，通过变分贝叶斯近似后验更新。
将内在奖励定义为信息增益项：η DKL[q(θ; φt+1) || q(θ; φt)].
实现一个实际的 SGVB（Bayes by Backprop）训练流程，针对 θ 采用完全因子化的高斯后验。
定期使用回放池更新后验，以稳定学习并实现高效计算内在奖励。
将 VIME 与标准 RL 算法（例如 TRPO、REINFORCE、ERWR）结合，以改善连续控制任务中的探索。

实验结果

研究问题

RQ1VIME 是否在具有稀疏奖励的连续控制任务中改善探索和最终表现？
RQ2VIME 是否在除 TRPO 之外的不同底层 RL 算法中也有效？
RQ3探索参数 η 如何影响探索与利用之间的平衡？
RQ4变分贝叶斯动力学模型是否可以在不对状态-动作空间离散化的情况下扩展到高维连续控制？

主要发现

VIME 在若干稀疏奖励的连续控制任务上显著优于天真的探索策略（例如 MountainCar、CartPoleSwingup、HalfCheetah）。
VIME 与 TRPO、REINFORCE、ERWR 搭配时，在多个领域取得性能提升。
该方法使在包括分层 SwimmerGather 任务在内的稀疏奖励的挑战性任务上实现学习成为可能。
在 VIME 下的访问模式显示比高斯噪声更分散的探索，表明系统性探索。
存在一个广泛的 η 值范围，在不同算法下 MountainCar 可有效解决，表明探索信号的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。