QUICK REVIEW

[Paper Review] Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret

Alon Cohen, Tomer Koren|arXiv (Cornell University)|Feb 17, 2019

Advanced Bandit Algorithms Research27 references20 citations

TL;DR

This paper presents the first computationally efficient algorithm for learning Linear-Quadratic Regulators (LQR) with $ tilde{O}( untime)$ regret, resolving a long-standing open problem. By reformulating the LQR problem as a sequence of convex semi-definite programs, the algorithm uses optimistic policy updates that tighten over time, balancing exploration and exploitation to achieve near-optimal regret with polynomial-time computation.

ABSTRACT

We present the first computationally-efficient algorithm with $\widetilde O(\sqrt{T})$ regret for learning in Linear Quadratic Control systems with unknown dynamics. By that, we resolve an open question of Abbasi-Yadkori and Szepesvári (2011) and Dean, Mania, Matni, Recht, and Tu (2018).

Motivation & Objective

To resolve the open problem of achieving $ ilde{O}( untime)$ regret in LQR control with computationally efficient algorithms.
To design a learning algorithm that balances exploration and exploitation in unknown LQR systems without incurring exponential computational costs.
To provide a polynomial-time algorithm that matches the statistical regret bound of prior work while being practically implementable.
To establish a framework where semi-definite relaxation of the infinite-horizon LQR problem yields increasingly accurate approximations as data accumulates.
To extend the optimism-in-the-face-of-uncertainty principle to continuous-state LQR systems with provable efficiency and regret bounds.

Proposed method

Reformulate the infinite-horizon LQR problem as a convex semi-definite program (SDP) to enable efficient optimization.
Use a sequence of SDP relaxations to generate 'optimistic' policies that assume favorable system dynamics based on current estimates.
Maintain confidence sets over unknown system parameters using least-squares estimation and high-probability concentration bounds.
Apply Hanson-Wright inequality and $ ho$-net arguments to control the tail behavior of state and action norms during analysis.
Use trace and operator norm inequalities to bound the estimation error of the system dynamics matrix $(A_0, B_0)$.
Leverage the structure of the information matrix $V$ to derive inverse norm bounds, ensuring parameter estimation accuracy over time.

Experimental results

Research questions

RQ1Can a computationally efficient algorithm achieve $ ilde{O}( untime)$ regret in LQR control with unknown dynamics?
RQ2Is it possible to maintain optimism in the face of uncertainty in continuous LQR systems using convex optimization?
RQ3How can semi-definite programming be used to approximate the infinite-horizon LQR cost function while ensuring convergence?
RQ4What is the relationship between the sample size $T_0$ and the accuracy of the estimated system parameters in the presence of noise?
RQ5Can the algorithm balance exploration and exploitation without requiring non-convex optimization at each step?

Key findings

The proposed algorithm achieves $ ilde{O}( untime)$ regret for LQR control with unknown dynamics, matching the statistical lower bound up to logarithmic factors.
The algorithm runs in polynomial time per iteration, resolving the computational inefficiency of prior $O( untime)$-regret methods.
The estimation error of the system parameters decays as $O(1/ untime)$, with high probability, due to the growth of the information matrix $V$.
The smallest eigenvalue of the information matrix $V$ is lower bounded by $\Omega(T_0 \sigma^2)$, ensuring invertibility and stable learning.
With high probability, the trace of the estimation error matrix is bounded by $O(n^2 \sigma^2 \log(T_0 / \delta))$, where $n$ is the state-action dimension.
The algorithm ensures that the policy remains stable and cost-bounded throughout learning, even under initial uncertainty.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.