QUICK REVIEW

[论文解读] Drifting Reinforcement Learning: The Blessing of (More) Optimism in Face of Endogenous & Exogenous Dynamics

Wang Chi Cheung, David Simchi‐Levi|arXiv (Cornell University)|Jun 7, 2019

Advanced Bandit Algorithms Research参考文献 33被引用 2

一句话总结

本文提出 SWUCRL2-CW 和 BORL 算法，用于在奖励和转移概率随时间变化的非平稳 MDP 中进行强化学习，通过置信区间加宽技术在内生与外生漂移下保持乐观性。该方法实现了无需参数调整的动态遗憾界，性能匹配已知预算下的最优表现，克服了在漂移环境中进行乐观探索的挑战。

ABSTRACT

We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under temporal drifts, ie, both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. This setting captures the endogeneity, exogeneity, uncertainty, and partial feedback in sequential decision-making scenarios, and finds applications in vehicle remarketing and real-time bidding. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, ie, without knowing the variation budgets. Finally, we conduct numerical experiments to show that our proposed algorithms achieve superior empirical performance compared to existing algorithms. Notably, the interplay between endogeneity and exogeneity presents a unique challenge, absent in existing (stationary and non-stationary) stochastic online learning settings, when we apply the conventional Optimism in Face of Uncertainty principle to design algorithms with provably low dynamic regret for RL in drifting MDPs. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism into our learning algorithms to ensure low dynamic regret bounds. To extend our theoretical findings, we apply our framework to inventory control problems, and demonstrate how one can alternatively leverage special structures on the state transition distributions to bypass the difficulty in exploring time-varying environments.

研究动机与目标

解决 MDP 中同时存在内生（自我驱动）和外生（外部）时间漂移的序列决策问题。
在奖励与转移分布随时间演变且变化预算受约束的条件下，设计一种可证明高效的无折扣强化学习算法。
通过引入置信区间加宽，克服标准“面对不确定性时保持乐观”方法在非平稳环境下的局限性，以维持低动态遗憾。
设计一种无需参数的算法（BORL），可自适应调整未知变化预算，而无需事先知晓漂移水平。
通过在库存控制及现实世界场景（如实时出价与车辆再销售）中的应用，展示该框架的实际有效性。

提出的方法

提出 SWUCRL2-CW 算法，一种基于滑动窗口的 UCB 方法，通过维护估计 MDP 参数的置信区间，并加宽置信区间以在时变动态下保持乐观性。
引入一种新颖的置信区间加宽技术，显式考虑内生与外生漂移，确保对时间分布变化的鲁棒性。
设计 BORL 作为元算法，自适应调整 SWUCRL2-CW 的窗口大小与置信区间宽度，而无需了解变化预算。
使用总变差度量量化奖励与转移中的漂移，定义约束环境变化速率的变化预算。
利用状态转移分布中的特殊结构属性（例如在库存控制中）以减轻探索负担并改进遗憾界。
建立理论上的动态遗憾界，其规模与变化预算的平方根成正比，在缺乏先验知识的情况下，性能匹配已知预算下的最优表现。

实验结果

研究问题

RQ1如何将“面对不确定性时保持乐观”原则适应于同时存在内生与外生时间漂移的 MDP，以维持低动态遗憾？
RQ2能否设计一种无需参数的算法，在未知漂移水平的前提下，实现与已知预算算法相同的动态遗憾界？
RQ3在奖励与转移持续演化的非平稳强化学习环境中，置信区间加宽对遗憾性能有何影响？
RQ4在何种结构化环境（如库存控制）中，利用状态转移动态的特殊性质可减少对激进探索的需求？
RQ5内生与外生动态如何共同影响强化学习中可证明高效的探索策略设计？

主要发现

当变化预算已知时，SWUCRL2-CW 算法实现了与变化预算平方根成正比的动态遗憾界。
BORL 算法在无需事先知晓变化预算的情况下，实现了与 SWUCRL2-CW 相同的动态遗憾界，因而为无参数算法。
置信区间加宽显著提升了在漂移环境中的性能，通过引入额外的乐观性来抵消分布漂移的影响。
数值实验表明，SWUCRL2-CW 与 BORL 在时间变化动态下的经验遗憾与稳定性方面均优于现有算法。
在库存控制等结构化环境中，利用状态转移分布的特殊属性可避免过度探索，从而改善遗憾界。
内生与外生动态的相互作用使标准乐观性原则失效，因此需要采用如置信区间加宽等新型算法技术以实现可证明的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。