QUICK REVIEW

[论文解读] Better Exploration with Optimistic Actor-Critic

Kamil Ciosek, Quan Vuong|White Rose Research Online (University of Leeds, The University of Sheffield, University of York)|Oct 28, 2019

Adversarial Robustness in Machine Learning被引用 35

一句话总结

乐观 Actor-Critic (OAC) 引入一种探索策略，在 Q 的上置信界上最大化，同时通过 KL 约束使其接近目标策略，在 Humanoid 上实现最先进的样本利用效率，并在 MuJoCo 基准测试中取得强劲结果。

ABSTRACT

Actor-critic methods, a type of model-free Reinforcement Learning, have been successfully applied to challenging tasks in continuous control, often achieving state-of-the art performance. However, wide-scale adoption of these methods in real-world domains is made difficult by their poor sample efficiency. We address this problem both theoretically and empirically. On the theoretical side, we identify two phenomena preventing efficient exploration in existing state-of-the-art algorithms such as Soft Actor Critic. First, combining a greedy actor update with a pessimistic estimate of the critic leads to the avoidance of actions that the agent does not know about, a phenomenon we call pessimistic underexploration. Second, current algorithms are directionally uninformed, sampling actions with equal probability in opposite directions from the current mean. This is wasteful, since we typically need actions taken along certain directions much more than others. To address both of these phenomena, we introduce a new algorithm, Optimistic Actor Critic, which approximates a lower and upper confidence bound on the state-action value function. This allows us to apply the principle of optimism in the face of uncertainty to perform directed exploration using the upper bound while still using the lower bound to avoid overestimation. We evaluate OAC in several challenging continuous control tasks, achieving state-of the art sample efficiency.

研究动机与目标

激励在连续控制的演员-评论家方法中需要更高样本效率的探索。
识别像悲观性不足探索和方向性无信息性等阻碍探索的机制。
提出并推导OAC，使其在保持稳定性的同时实现带KL约束的乐观探索。
在MuJoCo任务上对OAC进行经验评估，并展示在样本效率方面的改进，尤以Humanoid为甚。

提出的方法

使用自举评论家估计推导Q的上置信界。
定义一个探索策略，在满足与目标策略的KL约束条件下最大化上界。
将探索策略解析地计算为一个高斯分布，其均值相对于Q UB梯度方向偏移（在高斯策略给定的情况下的闭式解）。
用下界更新评论家以避免过估计，并使用目标网络以确保稳定性。
使用带有来自探索策略的探索动作的离策略记忆进行训练，评估时使用目标策略。
提供消融实验以隔离自举不确定性效应和超参数敏感性。

实验结果

研究问题

RQ1通过对Q函数的上置信界进行乐观探索，是否能提升演员-评论家方法的样本效率？
RQ2在探索策略和目标策略之间强制KL约束，是否能在实现定向探索的同时稳定离策略学习？
RQ3自举不确定性估计对连续控制任务性能的影响如何？
RQ4在标准MuJoCo基准上，OAC在样本效率和稳定性方面与SAC、TD3和DDPG相比如何？

主要发现

OAC在Humanoid任务上实现了最先进的样本效率，超过SAC。
使用自举不确定性来形成上界改善在具有挑战性的领域（尤其是Humanoid）中的表现，并且在高方差任务上有优势。
由(6)和(9)推导的探索策略由于KL约束，始终接近目标策略，支持稳定性。
基于上界的探索计算成本低，且在MuJoCo环境中带来与基线方法相当或更好的性能提升。
尽管使用独立的探索策略，OAC在实践中的稳定性与SAC相当。
消融实验表明自举不确定性和乐观上界有助于提升性能，并对KL参数δ进行了稳健的敏感性分析。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。