QUICK REVIEW

[论文解读] Logically-Constrained Reinforcement Learning

Mohammadhosein Hasanbeig, Alessandro Abate|arXiv (Cornell University)|Jan 24, 2018

Reinforcement Learning in Robotics参考文献 42被引用 37

一句话总结

该论文提出了一种逻辑约束强化学习（LCRL），这是首个无需模型的强化学习算法，能够为未知的马尔可夫决策过程（MDP）合成策略，以最大化满足给定线性时序逻辑（LTL）性质的概率。通过将LTL公式转换为极限确定性Büchi自动机（LDBA），并利用其对奖励进行塑造，LCRL引导探索朝向与规范相关的状态，从而相比基于模型的方法实现更快的收敛速度和更高的可扩展性，在实验中观察到迭代次数减少了10倍。

ABSTRACT

We present the first model-free Reinforcement Learning (RL) algorithm to synthesise policies for an unknown Markov Decision Process (MDP), such that a linear time property is satisfied. The given temporal property is converted into a Limit Deterministic Buchi Automaton (LDBA) and a robust reward function is defined over the state-action pairs of the MDP according to the resulting LDBA. With this reward function, the policy synthesis procedure is "constrained" by the given specification. These constraints guide the MDP exploration so as to minimize the solution time by only considering the portion of the MDP that is relevant to satisfaction of the LTL property. This improves performance and scalability of the proposed method by avoiding an exhaustive update over the whole state space while the efficiency of standard methods such as dynamic programming is hindered by excessive memory requirements, caused by the need to store a full-model in memory. Additionally, we show that the RL procedure sets up a local value iteration method to efficiently calculate the maximum probability of satisfying the given property, at any given state of the MDP. We prove that our algorithm is guaranteed to find a policy whose traces probabilistically satisfy the LTL property if such a policy exists, and additionally we show that our method produces reasonable control policies even when the LTL property cannot be satisfied. The performance of the algorithm is evaluated via a set of numerical examples. We observe an improvement of one order of magnitude in the number of iterations required for the synthesis compared to existing approaches.

研究动机与目标

解决在无模型设置下为MDP合成控制策略以可证明满足复杂时序逻辑规范（如LTL）的挑战。
克服基于模型的方法（如动态规划）的可扩展性限制，后者需要完整状态空间存储和全面更新。
通过聚焦于与给定LTL性质满足相关的状态空间区域，实现高效的策略学习。
即使在完全满足LTL的概率不可能的情况下，也提供关于策略存在性和质量的理论保证。
开发一种在线值迭代方法，无需完整MDP模型即可实时计算满足LTL性质的最大概率。

提出的方法

将给定的LTL公式转换为极限确定性Büchi自动机（LDBA），其相比确定性Rabin自动机（DRA）具有更紧凑和高效的表示形式。
构建MDP与LDBA之间的在线、同步产品，以追踪联合状态-动作行为。
基于LDBA的接受条件，在MDP的状态-动作对上定义一个鲁棒的奖励函数，以奖励向满足LTL性质方向的进展。
使用此塑造后的奖励进行无模型强化学习（如Q-learning），以学习最大化满足LTL公式的概率的策略。
实现一种在线值迭代过程，仅聚焦于相关状态转移，以计算每个MDP状态满足LTL性质的最大概率。
利用LDBA的结构简化奖励分配，相比基于DRA的方法降低计算开销。

实验结果

研究问题

RQ1无模型强化学习能否有效通过时序逻辑规范（如LTL）进行约束，以引导策略合成？
RQ2在LTL到自动机的转换中，使用LDBA而非DRA是否能显著提升可扩展性和收敛速度？
RQ3基于LDBA接受条件导出的奖励函数是否能可靠地引导RL算法找到最大化LTL满足概率的策略？
RQ4是否能够通过一种避免完整状态空间更新的值迭代方法，计算LTL性质的最大满足概率？
RQ5LCRL在收敛速度和可扩展性方面与现有基于模型和无模型方法相比表现如何？

主要发现

LCRL在数值实验中显示，与现有方法相比，策略合成所需的迭代次数减少了10倍。
通过聚焦于与LTL性质相关的状态空间区域，LCRL的收敛速度显著快于经典强化学习和基于模型的方法。
如果存在满足LTL性质的策略，LCRL可保证找到使该性质满足概率最大的策略。
即使完全满足LTL在概率上不可行，LCRL仍能生成具有非零满足概率的合理且有意义的控制策略。
使用LDBA而非DRA可使产品MDP更简洁（例如，一个示例中为75个状态 vs. 150个状态），从而降低计算复杂度。
在线值迭代方法可在不存储完整MDP模型的情况下实现高效的概率计算，从而提升大规模系统的可扩展性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。