QUICK REVIEW

[論文レビュー] Multi-task Reinforcement Learning in Reproducing Kernel Hilbert Spaces via Cross-learning

Juan Cerviño, Juan Andrés Bazerque|arXiv (Cornell University)|Aug 27, 2020

Reinforcement Learning in Robotics参考文献 64被引用数 8

ひとこと要約

本稿では、再帰的核ヒルバート空間（RKHS）内での共有中心政策との類似性をタスク固有方策に制約することで、未知だが関連するタスクへの高速適応を可能にする、マルチタスク強化学習フレームワーク「クロスラーニング」を提案する。制約付き最適化として定式化し、投影政策勾配降下法を用いることで、近似的に最適な解に収束し、新しい障害物形状を含むナビゲーションタスクにおいて優れた一般化性能を示す。

ABSTRACT

Reinforcement learning (RL) is a framework to optimize a control policy using rewards that are revealed by the system as a response to a control action. In its standard form, RL involves a single agent that uses its policy to accomplish a specific task. These methods require large amounts of reward samples to achieve good performance, and may not generalize well when the task is modified, even if the new task is related. In this paper we are interested in a collaborative scheme in which multiple agents with different tasks optimize their policies jointly. To this end, we introduce cross-learning, in which agents tackling related tasks have their policies constrained to be close to one another. Two properties make our new approach attractive: (i) it produces a multi-task central policy that can be used as a starting point to adapt quickly to one of the tasks trained for, in a situation when the agent does not know which task is currently facing, and (ii) as in meta-learning, it adapts to environments related but different to those seen during training. We focus on continuous policies belonging to reproducing kernel Hilbert spaces for which we bound the distance between the task-specific policies and the cross-learned policy. To solve the resulting optimization problem, we resort to a projected policy gradient algorithm and prove that it converges to a near-optimal solution with high probability. We evaluate our methodology with a navigation example in which agents can move through environments with obstacles of multiple shapes and avoid obstacles not trained for.

研究の動機と目的

動的または未観測の環境における標準的単一タスク強化学習のサンプル非効率性と一般化性能の低さを解決すること。
共有中心政策を導入することで、異なるが関連するタスクを有する複数エージェント間での協調学習を可能にすること。
トレーニング中に観測されていないタスクへの方策一般化を向上させ、事前タスク分布の知識がなくてもメタラーニングの挙動を模倣すること。
高次元のカーネル表現に対しても収束保証を維持できる、連続方策におけるスケーラブルな最適化手法を開発すること。
トレーニングデータに存在しない新たな障害物幾何形状を含むナビゲーションタスクにおいて、ロバストな性能を示すこと。

提案手法

各タスク固有方策が再帰的核ヒルバート空間（RKHS）内での共有中心方策からある距離以内に位置するように制約する、制約付き最適化問題としてマルチタスク強化学習を定式化する。
クロスラーニングの制約で定義される実行可能集合上に方策を射影するために、二次制約付き二次計画（QCQP）を用いることで、中心方策との類似性を保証する。
結合制約の簡略化された平均満たし緩和を提案し、計算コストを低減するとともに、閉形式での射影を可能にする。
部分的観測性に対処し、分散を低減するために、確率的勾配推定を用いた投影政策勾配アルゴリズムを実装する。
次元削減を実現し、メモリの爆発を回避するために、カーネル近似技術（例：Nystroem法）を適用する。
勾配ノルムと中心方策への方策の近接度に基づく停止基準を導入し、近似的に最適な解への収束を保証する。

実験結果

リサーチクエスチョン

RQ1RKHS内での共有中心方策は、関連するが異なる強化学習タスク間での一般化を向上させ得るか？
RQ2クロスラーニングは、標準的単一タスク強化学習と比較して、未観測タスクにおけるサンプル効率と性能で優れているか？
RQ3確率的勾配とカーネルベース関数近似のもとで、投影政策勾配法の収束挙動はいかなるものか？
RQ4提案手法は、トレーニング中に観測されていない障害物配置に対しても、ナビゲーション環境で一般化可能か？
RQ5結合制約の緩和は、収束保証を維持しつつ、性能と計算複雑度にどのような影響を与えるか？

主な発見

提案されたクロスラーニング手法は、すべてのトレーニング済みタスクに良好に一般化する中心方策を生成し、単一タスク学習と比較して個々のタスク固有方策の性能を向上させる。
投影政策勾配アルゴリズムは、勾配分散とリプシッツ連続性に関する標準的仮定の下で、高確率で最適解の近傍に収束することが証明された。
平均満たし緩和による定式化は閉形式解を許容し、計算コストを低減するとともに解析を簡素化するが、収束保証を損なわない。
複数の障害物形状を含むナビゲーションタスクにおいて、クロスラーニングで学習した方策は、トレーニング中に観測されていない新しい障害物幾何形状に対して、タスク固有方策を上回る性能を示した。
共有方策構造とカーネルベース関数近似のおかげで、タスク数の増加に対してもロバストな性能を達成した。
理論的分析により、勾配推定誤差と方策更新ステップの誤差が有界であり、収束速度がカーネル近似の品質と勾配分散の上限に依存することが確認された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。