QUICK REVIEW

[論文レビュー] Logically-Correct Reinforcement Learning.

Mohammadhosein Hasanbeig, Alessandro Abate|arXiv (Cornell University)|Jan 24, 2018

Reinforcement Learning in Robotics参考文献 34被引用数 30

ひとこと要約

本論文は、線形時系列性質を満たすMDP方策を合成する強化学習アルゴリズムを提案する。性質を限界決定的 Buchi 自動機（LDBA）に変換し、製品MDPを構築し、LDBAの受容条件に基づいて報酬を割り当てる。この手法により、オンライン価値反復を用いて最大満たし確率を計算可能となり、従来手法と比較して反復回数が10倍に削減された。

ABSTRACT

We propose a novel Reinforcement Learning (RL) algorithm to synthesize policies for a Markov Decision Process (MDP), such that a linear time property is satisfied. We convert the property into a Limit Deterministic Buchi Automaton (LDBA), then construct a product MDP between the automaton and the original MDP. A reward function is then assigned to the states of the product automaton, according to accepting conditions of the LDBA. With this reward function, RL synthesizes a policy that satisfies the property: as such, the policy synthesis procedure is constrained by the given specification. Additionally, we show that the RL procedure sets up an online value iteration method to calculate the maximum probability of satisfying the given property, at any given state of the MDP - a convergence proof for the procedure is provided. Finally, the performance of the algorithm is evaluated via a set of numerical examples. We observe an improvement of one order of magnitude in the number of iterations required for the synthesis compared to existing approaches.

研究の動機と目的

複雑な線形時系列性質を provably 満たすMDP方策を合成する課題に対処すること。
自動機理論的合成を活用して形式的仕様検証を強化学習と統合すること。
方策学習中に与えられた性質を満たす確率の最大値をオンラインで計算できること。
従来手法と比較して、方策合成に必要な学習反復回数を削減すること。

提案手法

望ましい動作を表現するために、線形時系列性質を限界決定的 Buchi 自動機（LDBA）に変換する。
元のMDPとLDBAを合成して、結合状態空間を符号化する製品MDPを構築する。
LDBAの受容条件に基づいて、製品MDPの状態に報酬関数を定義し、方策学習を誘導する。
報酬関数を用いて強化学習を実行し、性質を満たす確率を最大化する方策を合成する。
任意のMDP状態から最大満たし確率を推定するために、オンライン価値反復手順を用いる。
提案された報酬構造の下で、オンライン価値反復手順の収束を証明する。

実験結果

リサーチクエスチョン

RQ1形式的仕様によって強化学習が効果的に誘導可能であり、MDPの正しく構築された方策を合成できるか？
RQ2方策学習中にオンラインで線形時系列性質を満たす確率の最大値をどのように計算できるか？
RQ3自動機受容条件に基づく報酬形状戦略は、仕様を満たす方策への収束を保証するか？
RQ4提案手法は、既存の仕様誘導型RLアプローチと比較して、学習反復回数をどの程度削減できるか？

主な発見

提案手法は、仕様をLDBAに符号化し、製品構築によりMDPに統合することで、与えられた線形時系列性質を満たす方策を効果的に合成した。
LDBA受容条件から導出された報酬関数は、訓練中における性質満たしへの誘導に効果的であった。
オンライン価値反復手順は、MDPの任意の状態において真の最大満たし確率に収束することが、論文で証明された。
数値的評価では、既存手法と比較して、方策合成に要する反復回数が1桁単位で改善された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。