QUICK REVIEW

[论文解读] Reward-Free Exploration for Reinforcement Learning

Chi Jin, Akshay Krishnamurthy|arXiv (Cornell University)|Feb 7, 2020

Reinforcement Learning in Robotics参考文献 21被引用 26

一句话总结

本文提出了一种无奖励强化学习框架，其中智能体在无任何奖励信号的情况下首先探索一个MDP，收集轨迹以实现对任意未来奖励函数的高效规划。所提出的算法在使用单次探索阶段的情况下，实现了针对所有奖励函数的$\tilde{O}(S^2A\mathrm{poly}(H)/\epsilon^2)$近似最优样本复杂度，以找到$\epsilon$-次优策略。

ABSTRACT

Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose a new "reward-free RL" framework. In the exploration phase, the agent first collects trajectories from an MDP $\mathcal{M}$ without a pre-specified reward function. After exploration, it is tasked with computing near-optimal policies under for $\mathcal{M}$ for a collection of given reward functions. This framework is particularly suitable when there are many reward functions of interest, or when the reward function is shaped by an external agent to elicit desired behavior. We give an efficient algorithm that conducts $ ilde{\mathcal{O}}(S^2A\mathrm{poly}(H)/ε^2)$ episodes of exploration and returns $ε$-suboptimal policies for an arbitrary number of reward functions. We achieve this by finding exploratory policies that visit each "significant" state with probability proportional to its maximum visitation probability under any possible policy. Moreover, our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient. We also give a nearly-matching $Ω(S^2AH^2/ε^2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.

研究动机与目标

解决在需优化多个奖励函数时强化学习中样本效率低下的挑战。
将探索与奖励设定解耦，以支持对任意奖励函数的事后规划。
开发一种可证明高效的算法，仅在探索阶段收集一个数据集，该数据集足以支持任何奖励函数下的规划。
建立无奖励探索的样本复杂度的理论边界，以刻画其基本限制。
提供一个支持任意黑箱规划算法的框架，提升灵活性与实用性。

提出的方法

提出无奖励RL范式：在探索阶段从一个MDP $\mathcal{M}$ 中收集轨迹，而无需奖励函数。
设计一种探索算法，确保每个重要状态的访问概率与其在任何策略下可能达到的最大访问概率成正比。
使用黑箱强化学习算法（例如值迭代或自然策略梯度）作为子程序，以生成探索性策略。
在探索期间构建一个数据集，使得任何后续规划算法均可基于该数据集为任意奖励函数计算$\epsilon$-次优策略。
利用一种新颖的内积分析方法，证明探索策略对所有相关状态-动作对的覆盖足够均匀，从而支持泛化。
通过标准的批处理强化学习求解器实例化规划阶段，确保与现有算法的兼容性。

实验结果

研究问题

RQ1我们能否设计一个单一的探索阶段，使得在无需额外数据收集的情况下，能够高效规划任意数量的未来奖励函数？
RQ2在表格型MDP中，实现无奖励RL的充分覆盖的基本样本复杂度是多少？
RQ3无奖励探索的样本复杂度与预设奖励的标准强化学习相比如何？
RQ4我们能否在解耦探索与规划的同时，实现样本复杂度的近似最优？
RQ5在无奖励探索设置下，覆盖质量的理论极限是什么？

主要发现

所提出的算法在探索阶段实现了$\tilde{O}(S^2A\mathrm{poly}(H)/\epsilon^2)$的样本复杂度，接近最优。
该算法保证仅使用预先收集的数据集，无需进一步与环境交互，即可为任意奖励函数计算出$\epsilon$-次优策略。
建立了几乎匹配的下界$\Omega(S^2AH^2/\epsilon^2)$，表明在无奖励设置下样本复杂度是近似最优的。
该算法的探索阶段在概念上简单，并与任何黑箱规划器（如值迭代或自然策略梯度）兼容。
该框架揭示了覆盖的固有代价：由于需要实现普遍覆盖，无奖励的样本复杂度是预设奖励标准RL的$S$倍。
该分析证明，探索策略通过最大化最小访问概率，确保了即使在存在难以触及状态的环境中，所有重要状态也能得到充分访问。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。