QUICK REVIEW

[论文解读] Fast active learning for pure exploration in reinforcement learning

Pierre Ménard, Omar Darwiche Domingues|Repositori digital de la UPF (Universitat Pompeu Fabra)|Jul 27, 2020

Reinforcement Learning in Robotics参考文献 28被引用 29

一句话总结

本文提出BPI-UCBVI，一种用于稀疏奖励的回合制马尔可夫决策过程中的最优策略识别的新算法。通过利用1/n探索奖励和对停止时间的精细化分析，该算法实现了Õ(SAH³ log(1/δ)/ε²)的最优样本复杂度，相较于先前方法消除了对时域H和状态空间S的次优依赖。

ABSTRACT

Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort on exploring efficiently. The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one side, and a few theoretically-backed exploration strategies on the other. Many of them are incarnated by intrinsic motivation and in particular explorations bonuses. A common rule of thumb for exploration bonuses is to use $1/\sqrt{n}$ bonus that is added to the empirical estimates of the reward, where $n$ is a number of times this particular state (or a state-action pair) was visited. We show that, surprisingly, for a pure-exploration objective of reward-free exploration, bonuses that scale with $1/n$ bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon $H$. Furthermore, we show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the best-policy identification setting, which is another pure-exploration objective, where the environment provides rewards but the agent is not penalized for its behavior during the exploration phase.

研究动机与目标

解决在稀疏或无奖励反馈下强化学习中高效纯探索的挑战。
通过消除对时域H和状态空间S的次优依赖，改进最优策略识别（BPI）的样本复杂度。
证明在无奖励和最优策略识别设置中，1/n探索奖励优于标准的1/√n规则。
提供一种理论基础坚实、数据依赖的策略选择机制，实现对δ、S、A和ε的最优依赖。

提出的方法

提出BPI-UCBVI，一种基于回合制UCBVI的算法，采用数据依赖的策略选择规则。
引入1/n探索奖励而非标准的1/√n，表明在纯探索设置中可实现更优的学习速率。
提出UCBVI类算法中简单遗憾的新型上界，实现对停止时间的更紧密分析。
利用KL散度和方差界的新颖分析，控制经验MDP中的估计误差。
应用KL散度的变分公式，推导策略价值差的浓度不等式。
推导一个新的辅助不等式（引理13），以对数和多项式项形式界定τ的增长，从而实现更紧的样本复杂度边界。

实验结果

研究问题

RQ1在纯探索强化学习设置中，1/n探索奖励是否优于1/√n？
RQ2对停止时间的更紧密分析是否能提升最优策略识别中的样本复杂度？
RQ3在具有前向模型访问权限的情况下，BPI中对时域H的依赖能否从H⁴降低至H³？
RQ4仅具有前向模型访问权限而无Oracle访问权限时，是否可能在BPI中实现对δ、S、A和ε的最优依赖？
RQ5在BPI中，数据依赖的策略选择规则是否优于均匀随机选择？

主要发现

BPI-UCBVI实现了Õ(SAH³ log(1/δ)/ε²)的样本复杂度，根据Dann与Brunskill（2015）的下界，该复杂度在S、A、ε和δ方面均为最优。
与先前方法相比，该算法将对时域H的依赖从H⁴降低至H³，样本复杂度实现因子-H的改进。
使用1/n奖励而非1/√n可使无奖励探索和最优策略识别中的学习速率更快，遗憾界更紧。
所提出的UCBVI类算法中简单遗憾的上界消除了RF-UCRL中存在S因子，实现了对状态空间大小的最优依赖。
分析表明，通过使用更精细的KL散度不等式，可对停止时间实现更紧密的界定，从而改善对δ的依赖。
辅助不等式（引理13）实现了对τ在对数和多项式项上的更紧密控制，这对推导最终的样本复杂度边界至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。