QUICK REVIEW

[论文解读] Contextual Decision Processes with Low Bellman Rank are PAC-Learnable

Nan Jiang, Akshay Krishnamurthy|arXiv (Cornell University)|Oct 29, 2016

Neural Networks and Applications被引用 153

一句话总结

论文介绍 Contextual Decision Processes (CDPs) 与一个低 Bellman rank 条件，然后给出 Olive 算法，具备学习近似最优策略且与上下文空间大小无关的 PAC 保证。

ABSTRACT

This paper studies systematic exploration for reinforcement learning with rich observations and function approximation. We introduce a new model called contextual decision processes, that unifies and generalizes most prior settings. Our first contribution is a complexity measure, the Bellman rank, that we show enables tractable learning of near-optimal behavior in these processes and is naturally small for many well-studied reinforcement learning settings. Our second contribution is a new reinforcement learning algorithm that engages in systematic exploration to learn contextual decision processes with low Bellman rank. Our algorithm provably learns near-optimal behavior with a number of samples that is polynomial in all relevant parameters but independent of the number of unique observations. The approach uses Bellman error minimization with optimistic exploration and provides new insights into efficient exploration for reinforcement learning with function approximation.

研究动机与目标

在统一的 CDP 框架下，动机是通过丰富的观测与函数近似来推动强化学习。
将 Bellman rank 定义为一个复杂度度量，用以捕捉 CDP 中可探索的结构。
提出 Olive 算法，将乐观探索与基于 Bellman-误差的消除相结合。
证明 PAC 保证：近似最优策略的样本复杂度在 M、H、K 的多项式数量和对数因子之内，与上下文空间大小无关。

提出的方法

将 CDP 正规化为一个通过上下文包络 MDPs 和 POMDPs 的通用 RL 模型。
引入 Bellman 分解和 Bellman rank，用以量化可利用的结构。
为 CDP 场景定义平均 Bellman 误差和 Bellman 方程。
开发 Olive（Optimism Led Iterative Value-function Elimination），基于 Bellman 误差迭代地消除非有效的价值函数。
给出一个 PAC 保证，表明样本复杂度是多项式级的：poly(M, H, K, 1/epsilon, log N, 1/delta)，且与上下文空间大小无关。

实验结果

研究问题

RQ1在具有丰富观测的 CDP 中，能否利用函数逼近高效地学习到近似最优策略？
RQ2Bellman rank 如何量化跨越多样化 RL 场景的探索可行性？
RQ3单一算法是否能够在低 Bellman rank 的 MDP、POMDP 及相关模型中提供 PAC 保证？
RQ4将 Bellman 误差最小化与乐观探索结合在实现样本效率方面的作用是什么？

主要发现

具有低 Bellman rank 的 CDPs 具有可控的、样本高效的学习。
Olive 实现了一个 PAC 保证：它在轨迹数量为 tilde-ODE poly(M, H, K, log(N/δ), 1/ε) 的数量级上找到一个 ε-suboptimal 策略。
的样本复杂度独立于上下文空间的大小。
Bellman rank 框架适用于表格化 MDP、低秩 MDP、反应式 POMDP、PSR，甚至是 LQR（对连续动作有注意事项）。
该方法将 Bellman-误差最小化与乐观探索结合起来，为在函数逼近下的探索提供了新见解。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。