QUICK REVIEW

[论文解读] Neural Predictive Belief Representations

Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar|arXiv (Cornell University)|Nov 15, 2018

Domain Adaptation and Few-Shot Learning参考文献 42被引用 47

一句话总结

论文研究无监督神经方法（一步帧预测、CPC 以及 CPC|Action）在部分可观测环境中学习信念状态表示，结果表明这些表示编码了状态与不确定性，多步、带行动条件的 CPC 在视觉复杂环境中取得最佳结果。

ABSTRACT

Unsupervised representation learning has succeeded with excellent results in many applications. It is an especially powerful tool to learn a good representation of environments with partial or noisy observations. In partially observable domains it is important for the representation to encode a belief state, a sufficient statistic of the observations seen so far. In this paper, we investigate whether it is possible to learn such a belief representation using modern neural architectures. Specifically, we focus on one-step frame prediction and two variants of contrastive predictive coding (CPC) as the objective functions to learn the representations. To evaluate these learned representations, we test how well they can predict various pieces of information about the underlying state of the environment, e.g., position of the agent in a 3D maze. We show that all three methods are able to learn belief representations of the environment, they encode not only the state information, but also its uncertainty, a crucial aspect of belief states. We also find that for CPC multi-step predictions and action-conditioning are critical for accurate belief representations in visually complex environments. The ability of neural representations to capture the belief information has the potential to spur new advances for learning and planning in partially observable domains, where leveraging uncertainty is essential for optimal decision making.

研究动机与目标

动机：在部分可观测环境中学习能够总结过去观测与行动的信念状态表示。
评估无监督方法是否能够从观测中恢复真实状态及不确定性。
在 DeepMind Lab 任务上比较一步帧预测、CPC 和 CPC|Action。
评估在不同视觉复杂度下学习到的表示在编码智能体位置、轨迹和对象位置方面的能力。
考察预测时间窗长度和行动条件对在复杂视觉环境中表示质量的影响。

提出的方法

采用三种表示学习目标：一步帧预测（FP）、对比预测编码（CPC）以及 CPC|Action（带行动条件的 CPC）。
使用基于 GRU 的历史编码器从过去的观测和行动生成信念状态 b_t。
对于 CPC/CPC|Action，通过 CPC 分类器从 b_t 预测未来观测 o_{t+k}，其中正/负样本来自同一批次。
对于 FP，使用转置卷积解码器从 b_t 预测下一个观测 o_{t+1}。
通过训练辅助预测器来恢复真实状态信息（例如智能体位置/朝向、过去轨迹、对象位置），而不通过表示反向传播来评估学习的信念表示。
结构包括一个 CNN 将观测嵌入到 z_t，一个信念 GRU 生成 b_t，一个用于 CPC|Action 的行动 GRU 处理未来行动，以及一个 MLP 预测真实量。
算法1 描述 CPC|Action 的训练：采样子轨迹，计算信念，展开未来行动，使用正样本未来观测和一个负样本计算 CPC 损失，求平均后更新。

实验结果

研究问题

RQ1无监督方法是否能够学习编码真实状态信息和不确定性的信念状态表示，在部分可观测环境中？
RQ2预测时域长度（1 步与 30 步）以及行动条件如何影响在视觉富集领域中学习到的信念表示的质量？
RQ3学习到的表示是否捕捉到智能体位置、过去轨迹及对象位置，并在多大程度上处理不确定性？

主要发现

Env	Algorithm	(x,y,θ)	Past (x,y,θ)	Objects (x,y)
fixed	FP	0.118±0.015	0.121±0.007	0.043±0.006
fixed	CPC 1	0.579±0.067	0.132±0.010	0.049±0.005
fixed	CPC 30	0.562±0.204	0.118±0.010	0.045±0.004
fixed	CPC\|Action 1	0.689±0.057	0.137±0.006	0.049±0.004
fixed	CPC\|Action 30	0.240±0.030	0.100±0.007	0.040±0.003
room	FP	0.517±0.123	0.285±0.017	0.484±0.005
room	CPC 1	2.010±0.142	0.311±0.017	0.498±0.008
room	CPC 30	0.482±0.157	0.257±0.022	0.481±0.005
room	CPC\|Action 1	2.274±0.117	0.308±0.018	0.484±0.005
room	CPC\|Action 30	0.689±0.066	0.276±0.029	0.484±0.008
maze	FP	0.178±0.207	0.233±0.029	0.322±0.008
maze	CPC 1	0.622±0.158	0.278±0.055	0.330±0.009
maze	CPC 30	0.244±0.058	0.213±0.031	0.325±0.015
maze	CPC\|Action 1	0.638±0.094	0.264±0.028	0.323±0.010
maze	CPC\|Action 30	0.182±0.034	0.206±0.029	0.323±0.010
terrain	FP	1.831±0.162	0.405±0.077	0.181±0.084
terrain	CPC 1	3.393±0.252	0.417±0.074	0.307±0.174
terrain	CPC 30	2.280±0.853	0.340±0.104	0.131±0.185
terrain	CPC\|Action 1	3.348±0.482	0.414±0.042	0.312±0.049
terrain	CPC\|Action 30	1.589±0.358	0.344±0.065	0.139±0.136

三种方法（FP、CPC 1 步与 CPC 30 步以及 CPC|Action）都能够学习编码智能体位置和朝向及过去轨迹的信念表示。
表示也编码对状态和对象的不确定性，随着智能体从观测和行动中获取信息而下降。
在视觉较简单的环境中，FP 常常最好地编码位置/朝向，而在视觉复杂的地形中，多步 CPC 方法（特别是 CPC|Action 30）表现最佳，且比 FP 更具计算效率。
基于 CPC 的方法比 FP 更好地捕捉对未来观测的分布，CPC|Action 通过对未来行动进行条件化进一步提升。
当对象显著改变未来观测时（如传送互动），对象位置信息更可靠地被捕捉；否则，对象更难编码，表明依赖于地图特定提示或情节记忆。
相比单步预测方法，在更长时间预测（30 步）并加入行动条件（CPC|Action）时，在类地形的环境中显著提升信念质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。