QUICK REVIEW

[论文解读] Recurrent Predictive State Policy Networks

Ahmed Hefny, Zita Marinho|arXiv (Cornell University)|Mar 5, 2018

Reinforcement Learning in Robotics被引用 1

一句话总结

本文提出递归预测状态策略（RPSP）网络，一种可微分的循环架构，利用预测状态表示（PSRs）在部分可观察环境中建模信念状态。通过将递归PSR滤波器与基于奖励的策略梯度及预测误差最小化训练的反应式策略相结合，RPSP在OpenAI Gym机器人控制任务中相较于GRUs和有限记忆模型表现出更优性能。

ABSTRACT

We introduce Recurrent Predictive State Policy (RPSP) networks, a recurrent architecture that brings insights from predictive state representations to reinforcement learning in partially observable environments. Predictive state policy networks consist of a recursive filter, which keeps track of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. The recursive filter leverages predictive state representations (PSRs) (Rosencrantz and Gordon, 2004; Sun et al., 2016) by modeling predictive state-- a prediction of the distribution of future observations conditioned on history and future actions. This representation gives rise to a rich class of statistically consistent algorithms (Hefny et al., 2018) to initialize the recursive filter. Predictive state serves as an equivalent representation of a belief state. Therefore, the policy component of the RPSP-network can be purely reactive, simplifying training while still allowing optimal behaviour. Moreover, we use the PSR interpretation during training as well, by incorporating prediction error in the loss function. The entire network (recursive filter and reactive policy) is still differentiable and can be trained using gradient based methods. We optimize our policy using a combination of policy gradient based on rewards (Williams, 1992) and gradient descent based on prediction error. We show the efficacy of RPSP-networks under partial observability on a set of robotic control tasks from OpenAI Gym. We empirically show that RPSP-networks perform well compared with memory-preserving networks such as GRUs, as well as finite memory models, being the overall best performing method.

研究动机与目标

解决传统信念状态难以维持的、在部分可观察环境中进行强化学习的挑战。
开发一种可微分的循环架构，利用预测状态表示（PSRs）实现高效且统计一致的信念追踪。
通过PSR作为充分的信念表示，实现纯粹的反应式策略，从而简化训练过程并保持最优性。
通过在损失函数中引入预测误差，结合基于奖励的策略梯度，提升训练稳定性和性能。
在部分可观察的机器人控制任务中，对RPSP与记忆保持网络（如GRUs）及有限记忆模型进行实证评估。

提出的方法

RPSP网络采用递归滤波器，基于PSR理论，利用历史和动作条件下的未来观测分布预测来维护预测状态。
预测状态表示使用先前工作中的统计一致算法（Hefny et al., 2018）进行初始化，确保信念估计的鲁棒性。
策略组件为纯粹的反应式策略，直接将预测状态映射到动作，从而简化训练并实现端到端可微分性。
网络采用混合损失函数进行训练：基于累积奖励的策略梯度（Williams, 1992）与基于预测误差的梯度下降相结合，以提升信念准确性。
整个架构具备可微分性，支持通过反向传播联合优化递归滤波器与策略。
在信念表示和训练过程中均采用PSR解释，增强泛化能力与一致性。

实验结果

研究问题

RQ1预测状态表示能否有效用于构建部分可观察强化学习中的可微分、循环信念模型？
RQ2基于预测状态的反应式策略是否能实现与GRU等记忆增强模型相当或更优的性能？
RQ3在训练目标中引入预测误差在多大程度上能提升策略学习与信念准确性？
RQ4在具有部分可观察性的机器人控制任务中，RPSP相较于有限记忆模型和基于GRU的智能体表现如何？

主要发现

在部分可观察性设置下，RPSP网络在一系列OpenAI Gym机器人控制任务中优于基于GRU的记忆网络和有限记忆模型。
在损失函数中引入预测误差可提升预测状态表示的准确性，从而增强策略学习效果。
得益于PSR信念状态，反应式策略组件无需显式记忆即可实现最优行为，从而简化了训练与架构设计。
可微分架构支持信念追踪与策略的高效联合优化，从而产生稳定且高性能的策略。
实证结果表明，RPSP在所测试基线方法中整体表现最佳，尤其在长时序与部分可观察场景中表现突出。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。