QUICK REVIEW

[论文解读] Model-Free Linear Quadratic Control via Reduction to Expert Prediction

Yasin Abbasi-Yadkori, Nevena Lazic|arXiv (Cornell University)|Apr 17, 2018

Advanced Bandit Algorithms Research被引用 54

一句话总结

一种无模型的自适应 LQ 控制算法，具有子线性遗憾，采用专家预测归约和带强制探索的类策略迭代方案。

ABSTRACT

Model-free approaches for reinforcement learning (RL) and continuous control find policies based only on past states and rewards, without fitting a model of the system dynamics. They are appealing as they are general purpose and easy to implement; however, they also come with fewer theoretical guarantees than model-based RL. In this work, we present a new model-free algorithm for controlling linear quadratic (LQ) systems, and show that its regret scales as $O(T^{ξ+2/3})$ for any small $ξ>0$ if time horizon satisfies $T>C^{1/ξ}$ for a constant $C$. The algorithm is based on a reduction of control of Markov decision processes to an expert prediction problem. In practice, it corresponds to a variant of policy iteration with forced exploration, where the policy in each phase is greedy with respect to the average of all previous value functions. This is the first model-free algorithm for adaptive control of LQ systems that provably achieves sublinear regret and has a polynomial computation cost. Empirically, our algorithm dramatically outperforms standard policy iteration, but performs worse than a model-based approach.

研究动机与目标

在 LQ 设置下，为连续控制提供具有理论保证的无模型强化学习的动机。
开发一个无模型算法（MFLQ），在自适应 LQ 控制中实现子线性遗憾。
给出有限时间分析，展示在估计误差下的遗憾界限和稳定性。
证明 MFLQ 在经验上优于标准策略迭代并在经验上接近模型基准性能。

提出的方法

将 MDp 控制简化为一个专家预测问题，使用 Follow-the-Leader，与基于过去 Q 函数平均值的贪婪策略相结合。
使用带强制探索的变体策略迭代，其中每个阶段的策略都是相对于过去价值函数估计的平均值的贪婪策略。
通过带有二次价值形式的最小二乘时序差分（LSTD）来估计状态值函数 H，并投影到 H ≽ M。
基于估计的 H 和收集的数据来估计状态-行动值函数 G，数据通过探索和随机动作获得。
提供两种变体（MFLQv1 和 MFLQv2），具有不同的数据收集日程和阶段长度；推导遗憾界。
证明子线性遗憾：Regret_T ≤ C T^{2/3+ξ} 对于 v1，Regret_T ≤ C T^{3/4+ξ} 对于 v2，当 T 充分大时。

实验结果

研究问题

RQ1无模型的自适应 LQ 控制方法是否能够实现子线性遗憾？
RQ2将 MD P 控制简化为专家预测问题，如何在 LQ 设置中实现可解且具有可证明良好性能的策略？
RQ3在此情境下，值函数和策略评估的有限时间估计保证是什么？
RQ4探索日程如何影响无模型 LQ 控制的稳定性与长期性能？
RQ5在经验上，MFLQ 与策略迭代和模型基方法的性能比较如何？

主要发现

所提出的 MFLQ 算法在平均成本 LQ 设置下实现子线性遗憾：对于 MFLQv1 为 O(T^{2/3+ξ})，对于 MFLQv2 为 O(T^{3/4+ξ})，当 T 超过一个多对数阈值。
该算法是策略迭代的无模型改编，具有强制探索，并使用基于过去平均 Q 函数的 Follow-the-Leader 风格更新。
值函数 H 和状态-行动值函数 G 通过类似 LSTD 的过程估计，具有有限样本误差界限与投影步骤以确保稳定性。
在估计误差足够小的情况下，所有生成的策略的稳定性得以维持，导致值函数和状态有界。
实证结果表明 MFLQ 的变体优于标准策略迭代，并在测试的 LQ 情景中与模型基方法具有竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。