QUICK REVIEW

[论文解读] Exploration-Enhanced POLITEX

Yasin Abbasi-Yadkori, Nevena Lazic|arXiv (Cornell University)|Aug 27, 2019

Advanced Bandit Algorithms Research参考文献 30被引用 19

一句话总结

本文提出探索增强型POLITEX（EE-Politex），一种强化学习算法，通过整合一个预训练的、快速混合的探索策略，在具有线性函数逼近的平均成本MDP中改进了遗憾保证。与先前方法要求所有策略均需探索不同，EE-Politex利用探索策略生成状态覆盖以通过最小二乘蒙特卡洛（LSMC）估计值函数，从而在无需均匀探索假设的前提下实现次线性遗憾。

ABSTRACT

We study algorithms for average-cost reinforcement learning problems with value function approximation. Our starting point is the recently proposed POLITEX algorithm, a version of policy iteration where the policy produced in each iteration is near-optimal in hindsight for the sum of all past value function estimates. POLITEX has sublinear regret guarantees in uniformly-mixing MDPs when the value estimation error can be controlled, which can be satisfied if all policies sufficiently explore the environment. Unfortunately, this assumption is often unrealistic. Motivated by the rapid growth of interest in developing policies that learn to explore their environment in the lack of rewards (also known as no-reward learning), we replace the previous assumption that all policies explore the environment with that a single, sufficiently exploring policy is available beforehand. The main contribution of the paper is the modification of POLITEX to incorporate such an exploration policy in a way that allows us to obtain a regret guarantee similar to the previous one but without requiring that all policies explore environment. In addition to the novel theoretical guarantees, we demonstrate the benefits of our scheme on environments which are difficult to explore using simple schemes like dithering. While the solution we obtain may not achieve the best possible regret, it is the first result that shows how to control the regret in the presence of function approximation errors on problems where exploration is nontrivial. Our approach can also be seen as a way of reducing the problem of minimizing the regret to learning a good exploration policy. We believe that modular approaches like ours can be highly beneficial in tackling harder control problems.

研究动机与目标

解决现有Politex变体的局限性：这些方法要求所有策略在状态空间中均匀探索，以控制值函数估计误差。
在较弱的探索假设下，实现具有函数逼近的平均成本强化学习中的遗憾最小化。
将探索策略的学习与策略优化解耦，支持强化学习系统的模块化设计。
在稀疏奖励环境（如稀疏奖励的CartPole和网格世界MDP）中，展示显式探索的实证优势。
在单一、已存在的探索策略下，提供值估计误差和遗憾的理论保证。

提出的方法

提出一种混合数据收集方案：使用目标策略进行轨迹采样，但将轨迹初始化为预训练探索策略的平稳分布。
应用最小二乘蒙特卡洛（LSMC）从这些混合的在线与离线策略轨迹中估计值函数。
利用探索策略的快速混合特性，确保充分的状态覆盖，从而在目标策略为贪婪策略时仍能实现可靠的值函数估计。
修改Politex算法，使用基于探索策略初始状态分布生成的数据，通过LSMC获得的值函数估计。
分析在线性函数逼近下LSMC的估计误差，表明其与探索策略的混合时间及特征覆盖程度相关。
将LSMC估计器集成到Politex中，证明在弱于先前工作的假设下可实现次线性遗憾。

实验结果

研究问题

RQ1我们能否在具有线性函数逼近的平均成本MDP中，无需所有策略均探索，实现次线性遗憾？
RQ2如何利用单一预训练探索策略，提升模型无关强化学习中值函数估计的准确性并降低遗憾？
RQ3在具有线性函数逼近的设置下，使用混合在线与离线策略数据（来自目标策略和探索策略）对值函数估计误差有何影响？
RQ4在稀疏奖励环境（如CartPole摆动提升任务）中，显式探索是否能显著提升性能？
RQ5当用一个快速混合的探索策略替代均匀探索假设时，Politex的遗憾保证是否仍能保持？

主要发现

在均匀混合MDP中，EE-Politex实现了Õ(T^{3/4} + ε₀T)的遗憾边界，与先前Politex的保证一致，但假设条件更弱。
在快速混合的探索策略下，使用LSMC对混合数据进行值函数估计的误差规模为Õ(√(1/m))，从而在无需完整策略探索的前提下实现可靠估计。
在2×2网格世界中，所有方法均收敛至最优策略；但随着网格规模增大，无探索的Politex无法学习，而EE-Politex成功学习。
在稀疏奖励的CartPole摆动提升环境中，标准Politex无法学习最优策略（保持不活跃），而EE-Politex成功利用探索策略学会平衡木杆。
使用探索策略的一次访问LSMC估计在规模增大时表现不佳，因样本不足，表明需要更长的轨迹或多次访问才能实现稳定估计。
在Atari Ms. Pac-Man环境中，引入探索策略并未提升性能，表明EE-Politex的优势具有环境依赖性，最适用于高维、稀疏奖励的场景。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。