QUICK REVIEW

[論文レビュー] Naive Exploration is Optimal for Online LQR

Max Simchowitz, Dylan J. Foster|arXiv (Cornell University)|Jan 27, 2020

Advanced Bandit Algorithms Research参考文献 48被引用数 33

ひとこと要約

この論文はオンラインLQRの minimax regret を tilde Theta(sqrt(d_u^2 d_x T)) と証明し、連続探索を用いた単純な確信等価戦略が次元依存性において最適であることを、新しい自己境界付きODE摂動法により裏付けている。

ABSTRACT

We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown. We prove new upper and lower bounds demonstrating that the optimal regret scales as $\widetildeΘ({\sqrt{d_{\mathbf{u}}^2 d_{\mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{\mathbf{u}}$ is the dimension of the input space, and $d_{\mathbf{x}}$ is the dimension of the system state. Notably, our lower bounds rule out the possibility of a $\mathrm{poly}(\log{}T)$-regret algorithm, which had been conjectured due to the apparent strong convexity of the problem. Our upper bound is attained by a simple variant of $ extit{certainty equivalent control}$, where the learner selects control inputs according to the optimal controller for their estimate of the system while injecting exploratory random noise. While this approach was shown to achieve $\sqrt{T}$-regret by (Mania et al. 2019), we show that if the learner continually refines their estimates of the system matrices, the method attains optimal dimension dependence as well. Central to our upper and lower bounds is a new approach for controlling perturbations of Riccati equations called the $ extit{self-bounding ODE method}$, which we use to derive suboptimality bounds for the certainty equivalent controller synthesized from estimated system dynamics. This in turn enables regret upper bounds which hold for $ extit{any stabilizable instance}$ and scale with natural control-theoretic quantities.

研究の動機と目的

未知のLQRシステムに対するオンライン適応制御の研究動機を説明する。
未知のA,Bを持つオンラインLQRの minimax regret境界を特徴づける。
素朴な探索が次元最適なレートを達成し、polylog(T) regretを排除する。
Riccati方程式を摺動させる自己境界ODE法を導入する。
制御可能性の仮定を緩和しつつ強い後悔保証を維持する。

提案手法

任意のアルゴリズムに対して tilde Omega(sqrt(d_u^2 d_x T)) の後悔を示す局所的な minimax 下界を導出する。
離散代数Riccati方程式(DARE)解の摂動を界づける自己境界ODE法を開発する。
摂動境界を証明: |J_A,B[K_hat] - J_star| <= constant * ||P_infty(A,B)||_op^8 * (||A_hat-A||_F^2 + ||B_hat-B||_F^2).
Algorithm 1を提案する: 認識可能な制御と連続的epsilon-greedy探索とエポックベースの再推定(倍増エポックスケジュール)。
推定誤差は t の 1/sqrt(t) で dx du 部分空間で減衰し、残りの dx^2 次元では 1/t となり、後悔境界を導く。

実験結果

リサーチクエスチョン

RQ1オンラインLQRで未知のダイナミクスに対して対数的な(polylog(T))後悔は実現可能か？
RQ2素朴な探索(探索を伴う確信等価)は最適な後悔を達成するのか、それともより高度な戦略が必要か？
RQ3オンラインLQRの次元依存 minimax regret境界は正確にはどうなるか？
RQ4Riccati解の摂動は弱い条件下（非制御性）で制御可能性を下回らずに制御できるのか？
RQ5系の行列の継続的再推定は後悔と安定性にどう影響するか？

主な発見

Local minimax lower bound and global minimax bound establish tilde Omega(sqrt(d_u^2 d_x T)) regret for online LQR.
Certainty-equivalent control with continual epsilon-greedy exploration attains tilde O(sqrt(d_u^2 d_x T) + d_x^2) regret, matching the lower bound.
Bounds hold without requiring controllability and depend only on natural control-theoretic quantities.
Introduction of the Self-Bounding ODE method to perturbation analysis of Riccati equations.
Estimation refinement via doubling epoch schedule enables optimal dimension dependence.
Confirms that epsilon-greedy exploration is asymptotically rate-optimal for online LQR.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。