QUICK REVIEW

[論文レビュー] Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement

André Barreto, Diana Borsa|arXiv (Cornell University)|Jan 30, 2019

Reinforcement Learning in Robotics参考文献 28被引用数 38

ひとこと要約

本論文は後継特徴（SF）と一般化されたポリシー改善（GPI）転送フレームワークを、より広い報酬設定に一般化し、報酬をオンライン深層転送の特徴として使用できることを示し、未見タスクへのほぼ瞬時の転送を3D一人称環境で実演し、継続的再利用のためにポリシーを学習する。

ABSTRACT

The ability to transfer skills across tasks has the potential to scale up reinforcement learning (RL) agents to environments currently out of reach. Recently, a framework based on two ideas, successor features (SFs) and generalised policy improvement (GPI), has been introduced as a principled way of transferring skills. In this paper we extend the SFs & GPI framework in two ways. One of the basic assumptions underlying the original formulation of SFs & GPI is that rewards for all tasks of interest can be computed as linear combinations of a fixed set of features. We relax this constraint and show that the theoretical guarantees supporting the framework can be extended to any set of tasks that only differ in the reward function. Our second contribution is to show that one can use the reward functions themselves as features for future tasks, without any loss of expressiveness, thus removing the need to specify a set of features beforehand. This makes it possible to combine SFs & GPI with deep learning in a more stable way. We empirically verify this claim on a complex 3D environment where observations are images from a first-person perspective. We show that the transfer promoted by SFs & GPI leads to very good policies on unseen tasks almost instantaneously. We also describe how to learn policies specialised to the new tasks in a way that allows them to be added to the agent's set of skills, and thus be reused in the future.

研究の動機と目的

強化学習における転送を動機づけ、エージェントを複雑な環境へ拡張する。
報酬が固定の特徴の線形結合として表現可能であるという要件を緩和する。
報酬自体が将来のタスクの特徴として表現力を損なうことなく機能することを示す。
困難な3D環境でオンラインで深層学習に適合した転送を実現し、新しい技能の継続学習を可能にする。

提案手法

SF & GPI フレームワークを元の線形特徴設定を超える環境に拡張し、S, A, p, gamma を共有する広い M を定義する。
任意の報酬関数に対して転送されたポリシーの理論的保証（命題1）を提供する。
事前定義された特徴マッピングの必要性を、報酬関数自体を特徴として使用することで置換し、スケーラブルな深層学習統合を可能にする。
オンラインで学習・適用される SF を導入し、Q 学習と組み合わせた GPI（アルゴリズム 1）を提案し、新しいタスクに対してポリシーの特殊化を組み合わせる。
新しいタスクのために SF 基底を新しいタスク特有のポリシーで継続的に拡張して、成長するスキル集合を学習・再利用する方法を説明する。

実験結果

リサーチクエスチョン

RQ1タスクが固定特徴スパンを超えて報酬関数の違いを持つ場合でも、SF & GPI は性能保証を提供できるか。
RQ2報酬自体を特徴として使用することで、深層RL におけるオンライン転送をスケール可能に支援できるか。
RQ3SF & GPI は高次元の画像ベースの3D環境で見知らぬタスクへの効果的な転送を促進するか。
RQ4新しいタスクに特化したポリシーをどのように学習し、拡張されたスキルセットへ継続学習のために組み込むか。

主な発見

拡張された環境 M における転送ポリシーの性能が、報酬の差異と近似誤差を含む項によって制御されるという境界（命題 1）が確立される。
報酬を特徴として使用すると、SF が実際の価値関数となり、深層学習やオンライン更新との統合が容易になる。
3D の一人称環境で見られた経験的結果は、SF & GPI の下で未見タスクへの転送がほぼ瞬時に発生することを示す。
このフレームワークは、エージェントのスキルセットへ追加可能なタスク特化ポリシーの学習をサポートし、継続的再利用を実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。