QUICK REVIEW

[论文解读] Policy Iteration for Factored MDPs

Daphne Koller, Ronald Parr|arXiv (Cornell University)|Jan 16, 2013

Reinforcement Learning in Robotics参考文献 10被引用 151

一句话总结

本文提出了一种用于因子化马尔可夫决策过程（MDPs）的新颖策略迭代算法，该算法采用任意加权下的闭式最小二乘逼近值函数，实现了高效且精确的策略改进。该方法利用分解的基函数和基于变量消去的误差界，实现了紧凑的策略表示，并可扩展求解具有结构化动态特性的大规模MDP。

ABSTRACT

Many large MDPs can be represented compactly using a dynamic Bayesian network. Although the structure of the value function does not retain the structure of the process, recent work has shown that value functions in factored MDPs can often be approximated well using a decomposed value function: a linear combination of <i>restricted</i> basis functions, each of which refers only to a small subset of variables. An approximate value function for a particular policy can be computed using approximate dynamic programming, but this approach (and others) can only produce an approximation relative to a distance metric which is weighted by the stationary distribution of the current policy. This type of weighted projection is ill-suited to policy improvement. We present a new approach to value determination, that uses a simple closed-form computation to directly compute a least-squares decomposed approximation to the value function <i>for any weights</i>. We then use this value determination algorithm as a subroutine in a policy iteration process. We show that, under reasonable restrictions, the policies induced by a factored value function are compactly represented, and can be manipulated efficiently in a policy iteration process. We also present a method for computing error bounds for decomposed value functions using a variable-elimination algorithm for function optimization. The complexity of all of our algorithms depends on the factorization of system dynamics and of the approximate value function.

研究动机与目标

为解决加权投影方法在近似动态规划中因不适用于因子化MDP中的策略改进而存在的局限性。
通过利用动态特性和值函数表示中的结构化因子分解，实现在大规模MDP中的高效策略迭代。
开发一种不依赖于当前策略平稳分布的值函数逼近方法，从而实现直接的策略改进。
通过在函数优化中使用变量消去技术，为分解的值函数提供严格的误差界。
确保从因子化值函数导出的策略在整个策略迭代过程中保持紧凑可表示和可操作。

提出的方法

提出一种闭式最小二乘计算方法，用于对任意权重集合逼近值函数，从而避免对当前策略平稳分布的依赖。
采用分解的值函数表示作为仅依赖于状态变量子集的受限基函数的线性组合。
应用变量消去算法以计算近似值函数的误差界，确保对逼近质量的理论保证。
将值函数确定子程序集成到策略迭代框架中，实现具有紧凑策略表示的迭代策略改进。
采用MDP的转移和奖励函数的因子化表示，以保持计算效率。
通过变量消去进行函数优化，以界定真实值函数与其因子化逼近之间的误差。

实验结果

研究问题

RQ1能否对任意加权实现值函数逼近的闭式计算，且独立于当前策略的平稳分布，以支持可靠的策略改进？
RQ2如何在保持策略和值函数表示紧凑性的同时，高效求解因子化MDP？
RQ3因子化MDP中值函数逼近和策略迭代的计算复杂度是多少？其依赖于因子化结构的程度如何？
RQ4能否使用函数优化技术高效计算因子化值函数逼近的误差界？
RQ5在策略迭代中使用近似值函数时，是否能够保持策略的结构紧凑性？

主要发现

所提方法可对任意加权实现值函数逼近的直接计算，消除了对策略相关平稳分布的依赖。
使用新值函数确定方法进行的策略迭代，即使在大规模因子化MDP中，也能生成紧凑可表示且高效可操作的策略。
在函数优化中使用变量消去，可推导出对近似值函数的严格误差界。
所有算法的计算复杂度与系统动态特性和值函数基函数的因子化结构成比例，从而可高效处理大规模问题。
该方法表明，因子化值函数可有效用于策略迭代，克服了传统近似动态规划方法的局限性。
该方法可确保收敛至由因子化值函数基可表示的策略类中的最优策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。