QUICK REVIEW

[논문 리뷰] Naive Exploration is Optimal for Online LQR

Max Simchowitz, Dylan J. Foster|arXiv (Cornell University)|2020. 01. 27.

Advanced Bandit Algorithms Research참고 문헌 48인용 수 33

한 줄 요약

이 논문은 온라인 LQR에 대한 미니맥스 후회를 tilde Theta(sqrt(d_u^2 d_x T))로 증명하고, 연속 탐색이 포함된 간단한 확실성 등가 전략이 차원 의존성에서 최적임을 새로운 자기경계 ODE 섭동 방법으로 뒷받침한다.

ABSTRACT

We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown. We prove new upper and lower bounds demonstrating that the optimal regret scales as $\widetildeΘ({\sqrt{d_{\mathbf{u}}^2 d_{\mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{\mathbf{u}}$ is the dimension of the input space, and $d_{\mathbf{x}}$ is the dimension of the system state. Notably, our lower bounds rule out the possibility of a $\mathrm{poly}(\log{}T)$-regret algorithm, which had been conjectured due to the apparent strong convexity of the problem. Our upper bound is attained by a simple variant of $ extit{certainty equivalent control}$, where the learner selects control inputs according to the optimal controller for their estimate of the system while injecting exploratory random noise. While this approach was shown to achieve $\sqrt{T}$-regret by (Mania et al. 2019), we show that if the learner continually refines their estimates of the system matrices, the method attains optimal dimension dependence as well. Central to our upper and lower bounds is a new approach for controlling perturbations of Riccati equations called the $ extit{self-bounding ODE method}$, which we use to derive suboptimality bounds for the certainty equivalent controller synthesized from estimated system dynamics. This in turn enables regret upper bounds which hold for $ extit{any stabilizable instance}$ and scale with natural control-theoretic quantities.

연구 동기 및 목표

알 수 없는 LQR 시스템에 대한 온라인 적응 제어 연구를 촉진한다.
알 수 없는 A,B를 가진 온라인 LQR에 대한 미니맥스 후회 경계를 특성화한다.
단순 탐색이 차원 최적 속도를 달성함을 보이고 polylog(T) 후회를 배제한다.
리카티 방정식을 섭동시키기 위한 자기경계 ODE 방법을 도입한다.
강한 후회 보장을 유지하면서 제어 가능성 가정을 완화한다.

제안 방법

어떤 알고리즘에도 tilde Omega(sqrt(d_u^2 d_x T))의 후회를 보이는 지역적 미니맥스 하한을 도출한다.
이산 대수적 리카티 방정식(DARE) 해의 섭동을 바운드하는 자기경계 ODE 방법을 개발한다.
|J_A,B[K_hat] - J_star| <= 상수 * ||P_infty(A,B)||_op^8 * (||A_hat-A||_F^2 + ||B_hat-B||_F^2)인 섭동 경계를 증명한다.
Algorithm 1 제안: 연속적인 epsilon-탐욕적 탐색이 포함된 확실성 등가 제어와 주기 기반 재추정(에폭을 두 배로 늘리는 에폭 일정).
추정 오차가 dx du 부분공간에서 O(1/√t)로 감소하고 남은 dx^2 차원에서 O(1/t)로 감소하여 후회 경 Bound를 얻는다.

실험 결과

연구 질문

RQ1알 수 없는 동역학을 갖는 온라인 LQR에서 로그 증가적( polylog(T) ) 후회가 달성 가능한가?
RQ2단순 탐색(탐색이 포함된 확실성 등가)이 최적의 후회를 달성하는가, 아니면 더 정교한 전략이 필요한가?
RQ3온라인 LQR에 대한 정확한 차원 의존적 미니맥스 후회 경계는 무엇인가?
RQ4약한(비제어 가능성) 조건에서 리카티 해의 섭동을 제어할 수 있는가?
RQ5시스템 매트릭스의 지속적 재추정이 후회와 안정성에 어떤 영향을 미치는가?

주요 결과

지역적 미니맥스 복잡도는 비축지 조건에서 최적 m일 때 tilde Omega(sqrt(d_u^2 m T))의 하한을 산출하고, d_u <= d_x/2일 때 tilde Omega(sqrt(d_u^2 d_x T))로 귀결된다.
상한은 연속적 epsilon-탐욕적 탐색과 함께하는 확실성 등가 제어가, 제어 가능성 없이도 모든 안정화 가능한 온라인 LQR 인스턴스에 대해 후회를 tilde O(sqrt(d_u^2 d_x T) + d_x^2)로 달성함을 보인다.
결과는 자연스러운 제어 이론적 양에만 의존하며 시스템 매개변수에 대한 사전 지식을 필요로 하지 않는다.
Self-Bounding ODE 방법은 P_infty 및 K_infty에 대한 섭동 경계를 ||P_infty||_op에 비례하도록 제공하며 제어 가능성 행렬의 최소 특이값에 의존하지 않는다.
해석은 탐색과 추정 오차의 균형을 이루고, 섭동 부분공간에 직교하는 방향을 활용하여 최적의 차원 의존성을 달성한다.
하한과 상한이 함께 점근적으로 미니맥스 후회를 tilde Theta(sqrt(d_u^2 d_x T))로 확립한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.