QUICK REVIEW

[论文解读] Universal Successor Features Approximators

Diana Borsa, André Barreto|arXiv (Cornell University)|Dec 18, 2018

Reinforcement Learning in Robotics参考文献 26被引用 24

一句话总结

本文提出了通用后继特征近似器（USFAs），这是一种新颖的框架，通过将通用价值函数近似器（UVFAs）与后继特征及广义策略改进（GPI）相结合，实现了强化学习中未见任务的零样本泛化。通过联合利用价值函数、环境动态和策略空间中的结构，USFAs在复杂的3D导航环境中实现了卓越的迁移学习性能和即时策略评估能力。

ABSTRACT

The ability of a reinforcement learning (RL) agent to learn about many reward functions at the same time has many potential benefits, such as the decomposition of complex tasks into simpler ones, the exchange of information between tasks, and the reuse of skills. We focus on one aspect in particular, namely the ability to generalise to unseen tasks. Parametric generalisation relies on the interpolation power of a function approximator that is given the task description as input; one of its most common form are universal value function approximators (UVFAs). Another way to generalise to new tasks is to exploit structure in the RL problem itself. Generalised policy improvement (GPI) combines solutions of previous tasks into a policy for the unseen task; this relies on instantaneous policy evaluation of old policies under the new reward function, which is made possible through successor features (SFs). Our proposed universal successor features approximators (USFAs) combine the advantages of all of these, namely the scalability of UVFAs, the instant inference of SFs, and the strong generalisation of GPI. We discuss the challenges involved in training a USFA, its generalisation properties and demonstrate its practical benefits and transfer abilities on a large-scale domain in which the agent has to navigate in a first-person perspective three-dimensional environment.

研究动机与目标

为解决多任务强化学习中的零样本泛化挑战，通过整合价值函数、环境动态和策略空间的结构归纳偏置来实现。
通过将UVFAs（价值函数空间中的参数化泛化）与SF & GPI（通过动态规划实现泛化）统一为单一可扩展架构，克服现有方法的局限性。
通过解耦策略与任务表征，实现在大量任务间的高效迁移学习，同时通过GPI保持即时策略评估能力。
在具有视觉观测的大规模第一人称3D导航领域中，展示USFAs的实际优势。

提出的方法

提出通用后继特征近似器（USFAs）作为UVFAs的泛化形式，其中后继特征被扩展为依赖于任务描述符，从而实现多维价值函数近似。
使用神经网络参数化一个函数，将状态-动作-下一个状态转移与任务描述符映射到后继特征，实现在任务间的参数化泛化。
应用广义策略改进（GPI）来结合多个策略的评估结果，利用其USFA估计的后继特征，实现在未见奖励函数下的即时策略推理。
解耦策略与任务表征，以支持策略与后继特征的独立训练，提升样本效率与泛化能力。
使用时序差分学习与后继特征目标的监督回归相结合的方式训练USFA，同时在任务间共享特征表示。
利用后继特征在奖励函数上线性这一事实，实现在新奖励下无需微调即可快速评估策略。

实验结果

研究问题

RQ1单一函数近似器能否结合UVFAs（价值函数空间中的参数化泛化）与SF & GPI（通过环境结构与动态规划实现泛化）的优势，实现零样本迁移？
RQ2USFAs中策略与任务表征的解耦在高维、视觉化的3D环境中对泛化性能与训练稳定性有何影响？
RQ3在单一方法表现不佳的场景下（如需要大量策略或仅有少数策略能泛化），USFAs能多大程度上超越UVFAs与SF & GPI？
RQ4哪些关键的架构与训练选择使得USFAs能有效实现对未见任务的泛化？这些选择在大规模领域中如何扩展？

主要发现

USFAs可恢复UVFAs与SF & GPI作为特例，表明其严格泛化了这两种框架。
在存在大量最优策略的环境中，USFAs通过借鉴UVFA式函数近似的参数化泛化能力，优于基线SF & GPI。
在仅有少数策略能泛化的场景中，USFAs成功恢复了SF & GPI的强零样本性能，展现出在不同场景下的灵活性。
USFAs的解耦训练机制在后继特征近似不完善区域中，泛化能力优于标准UVFAs。
USFAs仅通过预训练的后继特征，即可实现在新奖励函数下的即时策略评估，显著降低推理时间，相比微调具有明显优势。
在大规模3D导航环境中的实证结果表明，USFAs实现了优异的迁移性能与可扩展性，验证了其在复杂视觉强化学习场景中的实际效用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。