QUICK REVIEW

[论文解读] Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

Simon S. Du, Sham M. Kakade|arXiv (Cornell University)|Oct 7, 2019

Reinforcement Learning in Robotics参考文献 47被引用 32

一句话总结

本文表明，仅具备良好表征不足以实现样本高效的强化学习，即使表征近乎最优，也对基于值、基于模型和基于策略的方法建立了样本复杂度的指数下界。关键贡献在于揭示：表征必须满足严格的、硬性的维度阈值才能实现高效学习，揭示了超越近似误差的根本性统计限制。

ABSTRACT

Modern deep learning methods provide effective means to learn good representations. However, is a good representation itself sufficient for sample efficient reinforcement learning? This question has largely been studied only with respect to (worst-case) approximation error, in the more classical approximate dynamic programming literature. With regards to the statistical viewpoint, this question is largely unexplored, and the extant body of literature mainly focuses on conditions which permit sample efficient reinforcement learning with little understanding of what are necessary conditions for efficient reinforcement learning. This work shows that, from the statistical viewpoint, the situation is far subtler than suggested by the more traditional approximation viewpoint, where the requirements on the representation that suffice for sample efficient RL are even more stringent. Our main results provide sharp thresholds for reinforcement learning methods, showing that there are hard limitations on what constitutes good function approximation (in terms of the dimensionality of the representation), where we focus on natural representational conditions relevant to value-based, model-based, and policy-based learning. These lower bounds highlight that having a good (value-based, model-based, or policy-based) representation in and of itself is insufficient for efficient reinforcement learning, unless the quality of this approximation passes certain hard thresholds. Furthermore, our lower bounds also imply exponential separations on the sample complexity between 1) value-based learning with perfect representation and value-based learning with a good-but-not-perfect representation, 2) value-based learning and policy-based learning, 3) policy-based learning and supervised learning and 4) reinforcement learning and imitation learning.

研究动机与目标

从统计角度探究良好表征是否足以实现样本高效的强化学习。
识别强化学习中实现样本效率的必要条件，超越以往研究中关注的充分条件。
为具有良好表征的基于值、基于模型和基于策略的强化学习算法建立精确的、指数级的样本复杂度下界。
展示不同强化学习范式及相关学习设置之间的指数级分离。
证明即使表征接近最优，若未满足严格的维度和间隔要求，仍无法实现高效学习。

提出的方法

构建具有二叉树结构且状态数指数级增长的MDP族，以制造困难的泛化挑战。
使用高维特征空间，采用$ ilde{ heta}(d)$维的表征以利用维度灾难。
在单位球面上使用$ riangle$-分离网，构造具有间隔$ riangle$的线性可分最优策略。
应用标准的$ riangle$-网下界，证明存在指数级大的向量集合，其两两之间保持$ riangle$分离。
在假设4.5（线性策略且具有间隔）下，推导出基于值、基于模型和基于策略的强化学习的样本复杂度指数下界。
比较不同设置下的样本复杂度：完美表征 vs. 良好但非完美的表征，基于值 vs. 基于策略，强化学习 vs. 示教学习，以及强化学习 vs. 监督学习。

实验结果

研究问题

RQ1从统计视角看，良好表征是否足以实现样本高效的强化学习？
RQ2在超越近似误差的前提下，表征需要满足哪些必要条件才能实现高效强化学习？
RQ3在具有良好表征的基于值、基于模型和基于策略的强化学习中，样本复杂度如何随规划时域$H$变化？
RQ4不同强化学习范式和学习设置之间存在哪些指数级分离？
RQ5接近最优的表征是否仍可能导致强化学习中出现指数级样本复杂度？

主要发现

即使在良好表征下，也为基于值、基于模型和基于策略的强化学习方法建立了指数级样本复杂度下界。
样本复杂度随规划时域$H$呈指数增长，表明仅靠良好表征无法确保学习效率。
在基于值的强化学习中，使用完美表征与使用良好但非完美的表征之间存在指数级分离。
即使最优$Q$-函数可被完美表征，基于策略的学习仍需要比基于值的学习多出指数级的样本。
当$H > 1$时，强化学习的样本复杂度远高于监督学习，尽管后者是前者的一个特例。
强化学习与示教学习之间存在指数级分离，表明专家演示可显著降低样本复杂度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。