QUICK REVIEW

[论文解读] Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability

Dibya Ghosh, Jad Rahme|arXiv (Cornell University)|Jul 13, 2021

Reinforcement Learning in Robotics被引用 26

一句话总结

该论文将 RL 泛化重新描述为解决由对 MDP 的知识不确定性引发的认知性的 POMDP，提出 LEEP 集成方法，并展示在 ProcGen 上的测试时泛化能力提升。

ABSTRACT

Generalization is a central challenge for the deployment of reinforcement learning (RL) systems in the real world. In this paper, we show that the sequential structure of the RL problem necessitates new approaches to generalization beyond the well-studied techniques used in supervised learning. While supervised learning methods can generalize effectively without explicitly accounting for epistemic uncertainty, we show that, perhaps surprisingly, this is not the case in RL. We show that generalization to unseen test conditions from a limited number of training conditions induces implicit partial observability, effectively turning even fully-observed MDPs into POMDPs. Informed by this observation, we recast the problem of generalization in RL as solving the induced partially observed Markov decision process, which we call the epistemic POMDP. We demonstrate the failure modes of algorithms that do not appropriately handle this partial observability, and suggest a simple ensemble-based technique for approximately solving the partially observed problem. Empirically, we demonstrate that our simple algorithm derived from the epistemic POMDP achieves significant gains in generalization over current methods on the Procgen benchmark suite.

研究动机与目标

Motivate why generalization in RL is harder than in supervised learning due to sequential structure and epistemic uncertainty.
Formalize generalization under training-test splits as an epistemic POMDP induced by posterior MDP uncertainty.
Propose a practical algorithm (LEEP) that ensembles policies and combines them to maximize test-time return.
Analyze failure modes of standard MDP-based RL methods that ignore implicit partial observability.
Demonstrate empirical gains on ProcGen benchmarks using the proposed approach.

提出的方法

Introduce the epistemic POMDP, where a posterior over MDPs is sampled and an episode is spent in a single sampled MDP, creating implicit partial observability.
Define the epistemic POMDP state as a pair (MDP, s) and show the test-time return equals the POMDP return under a well-specified prior.
Derive theoretical bounds linking the epistemic POMDP return to the performance of a set of policies across posterior MDPs.
Propose an empirical epistemic POMDP with a finite posterior sample size and decompose it into per-MDP policies that are later merged.
Present the LEEP algorithm that uses bootstrap samples to approximate the posterior and trains an ensemble of policies with a KL-divergence based coupling term.
Show how the final policy is constructed by aggregating the ensemble policies to maximize test-time performance.

实验结果

研究问题

RQ1MDP 的认知性不确定性如何影响 RL 在未见上下文中的泛化？
RQ2通过解决一个认知性 POMDP 而非单个 MDP，泛化是否可以被理解和改进？
RQ3在训练上下文有限的情况下，像 LEEP 这样的基于集合的方法是否能带来更好的测试时回报？
RQ4在隐式部分可观测性面前，标准的以 MDP 为中心的 RL 方法有哪些失败模式？
RQ5如何使用实际的后验近似（例如自举/bootstrap）来在上下文 RL 中实现贝叶斯最优行为？

主要发现

在由对训练上下文的认知性不确定性引起的隐式部分可观测性下，RL 的泛化受到阻碍。
认知性 POMDP 框架将测试时性能等同于在 MDP 的后验下的贝叶斯最优行为。
确定性、基于 MDP 的策略在测试时的不确定性下可能表现不佳；贝叶斯最优行为通常是随机的或非马可夫的。
一种简单的基于集合的方法（LEEP）可以逼近最大化测试时回报的贝叶斯最优策略。
在 ProcGen 任务上，LEEP 相对于标准 RL 基线在测试时性能上取得显著提升。
理论结果将每个 MDP 策略的性能及其被单一策略模仿的可 imitability 与整体 POMDP 性能联系起来，指导实际算法设计。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。