QUICK REVIEW

[論文レビュー] Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems

Sahin Lale, Kamyar Azizzadenesheli|arXiv (Cornell University)|Mar 25, 2020

Advanced Bandit Algorithms Research参考文献 61被引用数 50

ひとこと要約

The paper provides the first finite-time system identification method usable in both open- and closed-loop settings for partially observable linear dynamical systems and introduces AdaptOn, an adaptive online learning algorithm achieving polylogarithmic regret in T steps.

ABSTRACT

We study the problem of system identification and adaptive control in partially observable linear dynamical systems. Adaptive and closed-loop system identification is a challenging problem due to correlations introduced in data collection. In this paper, we present the first model estimation method with finite-time guarantees in both open and closed-loop system identification. Deploying this estimation method, we propose adaptive control online learning (AdaptOn), an efficient reinforcement learning algorithm that adaptively learns the system dynamics and continuously updates its controller through online learning steps. AdaptOn estimates the model dynamics by occasionally solving a linear regression problem through interactions with the environment. Using policy re-parameterization and the estimated model, AdaptOn constructs counterfactual loss functions to be used for updating the controller through online gradient descent. Over time, AdaptOn improves its model estimates and obtains more accurate gradient updates to improve the controller. We show that AdaptOn achieves a regret upper bound of $ ext{polylog}\left(T ight)$, after $T$ time steps of agent-environment interaction. To the best of our knowledge, AdaptOn is the first algorithm that achieves $ ext{polylog}\left(T ight)$ regret in adaptive control of unknown partially observable linear dynamical systems which includes linear quadratic Gaussian (LQG) control.

研究の動機と目的

部分観測可能な LDS における有限時間のシステム同定の動機づけと解決を行う。
開ループおよび閉ループの両方で使用できる predictor-form 推定法を開発する。
AdaptOn を提案する。 counterfactual losses を用いてコントローラを更新するオンライン学習アルゴリズム。
強凸コストの下で AdaptOn の polylogarithmic regret 境界を証明する。

提案手法

回帰を可能にするために、Kalman gain F および Abar を用いた predictor form でシステムを定式化する。
入力-出力データからマルコフパラメータ関連行列 G_y を推定する正則化付き最小二乗問題を設定する。
Hankel 行列と Ho-Kalman 風の手順を用いて、(A,B,C) および Markov-parameter 行列 G(H) を復元する SysId を開発する。
Nature’s の y を定義し、方策評価のための counterfactual 推論を可能にする b_t(G) を用いる。
Disturbance Feedback Control (DFC) を採用し、凸性のある方策パラメータ化とオンライン勾配更新を行う。
counterfactual losses を用いたオンライン凸最適化と定期的再推定を伴うエポックで AdaptOn を動作させる。

実験結果

リサーチクエスチョン

RQ1閉ループ推定において、モデルパラメータを有限時間保証付きで推定できるか？
RQ2このような推定を強化学習アルゴリズムが活用して、部分的に観測可能な LDS において有意に小さな後悔を達成できるか。
RQ3この設定でオンラインポリシー更新を駆動する counterfactual losses をどのように構築するか？

主な発見

Work	Regret	Cost	Identification	Noise
Lale et al. (2020)	T^{2/3}	Convex	Open-Loop	Stochastic
Simchowitz et al. (2020)	T^{2/3}	Convex	Open-Loop	Adversarial
Mania et al. (2019)	\\sqrt{T}	Strongly Convex	Open-Loop	Stochastic
Simchowitz et al. (2020)	\\sqrt{T}	Strongly Convex	Open-Loop	Semi-adversarial
This work	polylog(T)	Strongly Convex	Closed-Loop	Stochastic

有限時間のシステム同定保証：持続的に励起する入力の場合、推定誤差は tilde-O(1/√T) に縮小する。
AdaptOn は strongly convex losses の下で T ステップ後に polylog(T) の後悔上界を達成する。
本研究は、未知の部分観測線形ダイナミカルシステム（LQGを含む）に対する適応制御の対数的後悔の初めての結果を提供する。
閉ループ推定は関連研究の sqrt(T) 上界よりも改善された後悔をもたらす。
コレラリーは、DFC の近似が検討対象のポリシークラスに含まれる場合、近似的に最適な LQG コントローラへ結果を拡張する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。