QUICK REVIEW

[논문 리뷰] Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems

Sahin Lale, Kamyar Azizzadenesheli|arXiv (Cornell University)|2020. 03. 25.

Advanced Bandit Algorithms Research참고 문헌 61인용 수 50

한 줄 요약

논문은 부분적으로 관측 가능한 선형 동역학 시스템에 대해 개방 루프와 폐쇄 루프 설정 모두에서 사용할 수 있는 최초의 유한 시간 시스템 식별 방법을 제공하고, AdaptOn이라는 적응형 온라인 학습 알고리즘을 도입하여 T 단계에서 polylogarithmic regret를 달성합니다.

ABSTRACT

We study the problem of system identification and adaptive control in partially observable linear dynamical systems. Adaptive and closed-loop system identification is a challenging problem due to correlations introduced in data collection. In this paper, we present the first model estimation method with finite-time guarantees in both open and closed-loop system identification. Deploying this estimation method, we propose adaptive control online learning (AdaptOn), an efficient reinforcement learning algorithm that adaptively learns the system dynamics and continuously updates its controller through online learning steps. AdaptOn estimates the model dynamics by occasionally solving a linear regression problem through interactions with the environment. Using policy re-parameterization and the estimated model, AdaptOn constructs counterfactual loss functions to be used for updating the controller through online gradient descent. Over time, AdaptOn improves its model estimates and obtains more accurate gradient updates to improve the controller. We show that AdaptOn achieves a regret upper bound of $ ext{polylog}\left(T ight)$, after $T$ time steps of agent-environment interaction. To the best of our knowledge, AdaptOn is the first algorithm that achieves $ ext{polylog}\left(T ight)$ regret in adaptive control of unknown partially observable linear dynamical systems which includes linear quadratic Gaussian (LQG) control.

연구 동기 및 목표

부분적으로 관측 가능한 LDS에서 유한 시간 시스템 식별의 동기 부여 및 해결.
개방 루프 및 폐쇄 루프 설정에서 사용할 수 있는 예측자 형식 추정 방법 개발.
Counterfactual 손실을 사용하여 컨트롤러를 업데이트하는 온라인 학습 알고리즘 AdaptOn 제안.
강하게 볼록한 비용에서 AdaptOn에 대한 polylog(T) regret 경계 증명.

제안 방법

회귀를 가능하게 하기 위해 Kalman 이득 F와 Abar로 예측기 형태의 시스템을 공식화.
입출력 데이터로부터 G_y 관련 행렬을 추정하기 위해 정규화된 최소자승 문제를 설정.
Hankel 행렬 및 Ho-Kalman 스타일의 단계로 (A,B,C)와 Markov-parameter 행렬 G(H)를 복원하기 위해 SysId를 개발.
Nature의 y를 정의하고 정책 평가를 위한 counterfactual 추론을 가능하게 하기 위해 b_t(G)를 사용.
Disturbance Feedback Control (DFC)을 Convex 정책 매개변수화와 온라인 그래디언트 업데이트와 함께 채택.
주기적으로 재추정하고 counterfactual 손실을 사용한 온라인 convex optimization을 통해 Epoch에서 AdaptOn을 작동.

실험 결과

연구 질문

RQ1폐쇄 루프 추정에서 유한 시간 보장으로 파라미터를 추정할 수 있는가?
RQ2RL 알고리즘이 이러한 추정을 활용하여 부분적으로 관측 가능한 LDS에서 현저하게 감소된 regret를 달성할 수 있는가?
RQ3이 설정에서 온라인 정책 업데이트를 이끄는 counterfactual 손실을 어떻게 구성할 수 있는가?

주요 결과

작업에서의 주요 결과	레그레트	비용	식별	노이즈
Lale et al. (2020)	T^{2/3}	Convex	Open-Loop	Stochastic
Simchowitz et al. (2020)	T^{2/3}	Convex	Open-Loop	Adversarial
Mania et al. (2019)	\\sqrt{T}	Strongly Convex	Open-Loop	Stochastic
Simchowitz et al. (2020)	\\sqrt{T}	Strongly Convex	Open-Loop	Semi-adversarial
This work	polylog(T)	Strongly Convex	Closed-Loop	Stochastic

유한 시간 시스템 식별 보장: 지속적으로 충분히 자극된 입력과 함께 추정 오차가 tilde-O(1/√T)로 감소.
AdaptOn은 강하게 볼록한 손실하에서 T 스텝 뒤 polylog(T)의 regret 상한을 달성.
이 연구는 알려지지 않은 부분적으로 관측 가능한 선형 동역학 시스템의 적응 제어에 대해 로그( logarithmic ) regret에 대한 최초의 결과를 제공하며 LQG를 포함.
폐쇄 루프 추정은 관련 연구의 sqrt(T) 경계보다 개선된 regret를 제공합니다.
코올로리는 DFC 근사치가 제시된 정책 계급 안에 있을 때 근사 최적 LQG 컨트롤러에 대한 결과를 확장합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.