QUICK REVIEW

[论文解读] Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator

Stephen Tu, Benjamin Recht|arXiv (Cornell University)|Dec 22, 2017

Control Systems and Identification被引用 56

一句话总结

本文首次给出针对 LQR 问题的 LSTD 估计量的有限时间分析，推导在固定稳定策略下需要多少样本才能以 ε 相对误差估计值函数，以及在快速混合过程下的协方差特征值界限。

ABSTRACT

Reinforcement learning (RL) has been successfully used to solve many continuous control tasks. Despite its impressive results however, fundamental questions regarding the sample complexity of RL on continuous problems remain open. We study the performance of RL in this setting by considering the behavior of the Least-Squares Temporal Difference (LSTD) estimator on the classic Linear Quadratic Regulator (LQR) problem from optimal control. We give the first finite-time analysis of the number of samples needed to estimate the value function for a fixed static state-feedback policy to within $\\varepsilon$-relative error. In the process of deriving our result, we give a general characterization for when the minimum eigenvalue of the empirical covariance matrix formed along the sample path of a fast-mixing stochastic process concentrates above zero, extending a result by Koltchinskii and Mendelson in the independent covariates setting. Finally, we provide experimental evidence indicating that our analysis correctly captures the qualitative behavior of LSTD on several LQR instances.

研究动机与目标

以 LQR 作为基准，激励并量化基于模型自由的强化学习在连续控制中的样本复杂度。
分析在 LQR 中对固定策略的 Least-Squares Temporal Difference (LSTD) 估计量的性能。
在快速混合轨迹下，建立经验协方差矩阵的特征值集中性结果。
比较模型自由的 LSPI 与基于模型的方法，以评估实际数据效率和鲁棒性。

提出的方法

在 LQR 的线性结构价值函数下分析 LSTD。
推导有限时间样本复杂度界，表明大致需要 n^2/ε^2 份样本即可达到 ε 相对误差。
给出来自快速混合过程的样本协方差的一般特征值集中界，扩展 Koltchinskii 和 Mendelson 的结果。
将结果具体化到带线性反馈策略和高斯干扰的 LQR 设置。
提供 LSPI 与基于模型的方法的经验比较，以验证理论洞见。
利用李亚普诺夫基分析与 H∞-范数技术来表征快速混合与谱特性。

实验结果

研究问题

RQ1在固定稳定策略下，LSTD 估计 V^π 以有限样本达到的样本复杂度是多少？
RQ2沿着快速混合轨迹的经验协方差的最小特征值如何集中，以及这对 LSTD 的误差界有何影响？
RQ3对于 LQR，模型自由的 LSPI 与基于模型的方法在数据效率和鲁棒性方面的比较如何？
RQ4是否可以利用 LQR 设置将现有的协方差集中性结果扩展到由混合过程产生的依赖数据？

主要发现

LSTD 需要大约 n^2/ε^2 份样本才能在固定稳定策略下将值函数估计到 ε 相对误差。
本文给出快速混合过程的经验协方差最小特征值的一般界，扩展了先前的独立自变量结果。
专门化到有界协方差时，与之前的工作相比，轨迹长度的要求得到改善，在某些 setting 将依赖从 d^2 降至 d。
实证结果表明，在若干 LQR 实例中，模型自由的 LSPI 在样本效率和鲁棒性方面可能不及基于模型的方法。
分析表明，在模型自由设定中，值函数估计的样本需求与鲁棒控制器计算界之间存在状态维度差距。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。