QUICK REVIEW

[论文解读] DeepMDP: Learning Continuous Latent Space Models for Representation Learning

Carles Gelada, Saurabh Kumar|arXiv (Cornell University)|Jun 6, 2019

Reinforcement Learning in Robotics参考文献 58被引用 67

一句话总结

DeepMDP 通过最小化两个损失：奖励预测和下一个潜在状态预测，学习一个 MDP 的连续潜在空间模型，在作为 RL 的辅助任务时提供理论保证并提升性能。

ABSTRACT

Many reinforcement learning (RL) tasks provide the agent with high-dimensional observations that can be simplified into low-dimensional continuous states. To formalize this process, we introduce the concept of a DeepMDP, a parameterized latent space model that is trained via the minimization of two tractable losses: prediction of rewards and prediction of the distribution over next latent states. We show that the optimization of these objectives guarantees (1) the quality of the latent space as a representation of the state space and (2) the quality of the DeepMDP as a model of the environment. We connect these results to prior work in the bisimulation literature, and explore the use of a variety of metrics. Our theoretical findings are substantiated by the experimental result that a trained DeepMDP recovers the latent structure underlying high-dimensional observations on a synthetic environment. Finally, we show that learning a DeepMDP as an auxiliary task in the Atari 2600 domain leads to large performance improvements over model-free RL.

研究动机与目标

通过将高维观测降低到信息丰富的连续潜在状态来激发 RL 的表征学习。
提出一个 DeepMDP 潜在空间模型，通过可处理的损失来训练以获得奖励和下一个状态分布。
提供理论保证，将潜在空间学习与表征和模型质量联系起来。
将 DeepMDP 与 bisimulation（等价仿真）连接，并研究潜在转移的不同概率度量。
展示将 DeepMDP 作为辅助任务以提升无模型强化学习性能的实用性。）

提出的方法

将 DeepMDP 定义为一个潜在空间模型，嵌入 φ: S → S_bar。
通过最小化两个损失来训练：L_R = |R(s,a) - R_bar(phi(s),a)| 和 L_P = D(phi P(.|s,a), P_bar(.|phi(s),a])。
对潜在转移损失使用 Wasserstein（以及其他基于 MMD 的）度量，以实现理论保证。
推导关于值差异和表示质量的全局与局部界限，涉及 L_R、L_P 和 Lipschitz 常数。
建立 DeepMDP 与 Wasserstein 度量和 bisimulation 度量之间的联系。
将 guarantees 推广到 Norm-MMD 度量，并讨论对深度网络策略学习的影响。

实验结果

研究问题

RQ1经过奖励与转移预测训练的参数化潜在空间模型是否能同时提供对状态空间的良好表征和对环境的良好模型？
RQ2选择概率度量（特别是 Wasserstein）如何影响保证以及与 bisimulation 的关系？
RQ3DeepMDP 表征是否能够恢复潜在结构，该结构对高维观测有支撑？
RQ4将 DeepMDP 作为辅助任务时，是否能提升模型无关 RL 的性能，例如在 Atari 2600 游戏中？
RQ5从部分状态空间数据学习 DeepMDPs 时，本地（数据高效）的保证是什么？

主要发现

DeepMDP 提供界限，表明准确的潜在预测能在原始 MDP 中产生准确的价值函数。
嵌入 φ 确保如果全局损失 L_R 与 L_P 为零，DeepMDP 将保留价值关系，差异仅以一个 Lipschitz 项表示。
已建立基于 Wasserstein 的 DeepMDP 损失与 bisimulation 度量之间的理论联系。
在仅可用部分状态-动作数据时，局部 DeepMDP 损失提供保障。
实证结果显示，在一个合成环境中 DeepMDP 能从高维观测中恢复潜在结构。
在 Atari 2600 中将 DeepMDP 作为辅助任务比模型无关基线显著提升性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。