QUICK REVIEW

[论文解读] Naive Exploration is Optimal for Online LQR

Max Simchowitz, Dylan J. Foster|arXiv (Cornell University)|Jan 27, 2020

Advanced Bandit Algorithms Research参考文献 48被引用 33

一句话总结

本文证明在线LQR的极大最小后悔为 tilde Theta(sqrt(d_u^2 d_x T))，并展示一个简单的确定性等效策略在持续探索中的维度依赖是最优的，辅以一种新的自界定的ODE扰动方法来支撑。

ABSTRACT

We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown. We prove new upper and lower bounds demonstrating that the optimal regret scales as $\widetildeΘ({\sqrt{d_{\mathbf{u}}^2 d_{\mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{\mathbf{u}}$ is the dimension of the input space, and $d_{\mathbf{x}}$ is the dimension of the system state. Notably, our lower bounds rule out the possibility of a $\mathrm{poly}(\log{}T)$-regret algorithm, which had been conjectured due to the apparent strong convexity of the problem. Our upper bound is attained by a simple variant of $ extit{certainty equivalent control}$, where the learner selects control inputs according to the optimal controller for their estimate of the system while injecting exploratory random noise. While this approach was shown to achieve $\sqrt{T}$-regret by (Mania et al. 2019), we show that if the learner continually refines their estimates of the system matrices, the method attains optimal dimension dependence as well. Central to our upper and lower bounds is a new approach for controlling perturbations of Riccati equations called the $ extit{self-bounding ODE method}$, which we use to derive suboptimality bounds for the certainty equivalent controller synthesized from estimated system dynamics. This in turn enables regret upper bounds which hold for $ extit{any stabilizable instance}$ and scale with natural control-theoretic quantities.

研究动机与目标

激励对未知LQR系统的在线自适应控制研究。
表征未知A,B的在线LQR的极小极大后悔界限。
表明朴素探索在维度上达到最优速率并排除 polylog(T) 的后悔。
引入自界定的ODE方法以扰动Riccati方程。
放宽可控性假设同时保持强后悔保证。

提出的方法

推导局部极小极大下界，表明对任何算法都存在 tilde Omega(sqrt(d_u^2 d_x T)) 的后悔。
提出自界定的ODE方法以界定离散代数Riccati方程(DARE)解的扰动。
证明扰动界：|J_A,B[K_hat] - J_star| <= 常数 * ||P_infty(A,B)||_op^8 * (||A_hat-A||_F^2 + ||B_hat-B||_F^2)。
提出算法1：带持续epsilon-greedy探索的确定性等效控制与按阶段重新估计（时期翻倍的时期安排）。
表明估计误差在 d_x d_u 子空间内衰减为 O(1/√t)，在剩余的 d_x^2 维度内为 O(1/t)，从而给出后悔界。

实验结果

研究问题

RQ1在未知动力学的在线LQR中是否可以实现对数级别的后悔（多对数T）？
RQ2朴素探索（带探索的确定性等效）是否达到最优后悔，还是需要更复杂的策略？
RQ3在线LQR的精确的维度相关极小极大后悔界限是什么？
RQ4在弱（非可控）条件下，Riccati解的扰动是否可控？
RQ5系统矩阵的持续重新估计如何影响后悔和稳定性？

主要发现

局部极小极大下界与全局极大极小界共同确立在线LQR的 tilde Omega(sqrt(d_u^2 d_x T)) 的后悔。
带持续epsilon-greedy探索的确定性等效控制达到 tilde O(sqrt(d_u^2 d_x T) + d_x^2) 的后悔，匹配下界。
界限在不要求可控性的情况下成立，并且只依赖于自然的控制理论量。
引入自界定的ODE方法用于Riccati方程扰动分析。
通过 doubling epoch 调度的估计改进实现最优的维度依赖。
确认 epsilon-greedy 探索在在线 LQR 上渐近达到速率最优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。