QUICK REVIEW

[論文レビュー] Mask-based Latent Reconstruction for Reinforcement Learning

Changyuan Yu, Zhizheng Zhang|arXiv (Cornell University)|Jan 28, 2022

Reinforcement Learning in Robotics被引用数 21

ひとこと要約

MLR はマスクベースの自己監視目的を追加し、マスクされたビデオ観測から潜在状態表現を再構成することで、連続および離散 RL ベンチマークのサンプル効率を改善します。

ABSTRACT

For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance. However, in practice, limited experience and high-dimensional inputs prevent effective representation learning. To address this, motivated by the success of mask-based modeling in other research fields, we introduce mask-based reconstruction to promote state representation learning in RL. Specifically, we propose a simple yet effective self-supervised method, Mask-based Latent Reconstruction (MLR), to predict complete state representations in the latent space from the observations with spatially and temporally masked pixels. MLR enables better use of context information when learning state representations to make them more informative, which facilitates the training of RL agents. Extensive experiments show that our MLR significantly improves the sample efficiency in RL and outperforms the state-of-the-art sample-efficient RL methods on multiple continuous and discrete control benchmarks. Our code is available at https://github.com/microsoft/Mask-based-Latent-Reconstruction.

研究の動機と目的

視覚ベースの RL のためにより情報量の多い状態表現を促進する。
追加データ収集なしで表現学習とポリシー学習の共同最適化を可能にする。
ピクセルではなく潜在特徴を再構成することで、RL に適したマスクベースのモデリングを活用する。
RL における実際の利得を理解するための実践的なマスキング戦略とターゲット空間を調査する。

提案手法

入力観測シーケンスの時空間キューブをランダムにマスクする（キューブマスキング）。
マスクされた観測をオンラインエンコーダで、元の観測をモーメンタムエンコーダでエンコードする。
アクションと時間的位置に条件付けられた潜在表現を予測するために、トランスフォーマーベースの潜在デコーダを使用する。
予測潜在特徴とターゲット潜在特徴のコサイン類似度に基づく潜在空間再構成損失で訓練する。
トレーニング時にMLR損失をベースRL目的関数と重み付き和で結合する（L_total = L_rl + lambda L_mlr）。
崩壊を回避するためにターゲット潜在表現にストップグラデーションを適用する。

実験結果

リサーチクエスチョン

RQ1ピクセル入力からの RL の表現学習を、マスクベースの潜在再構成が改善しますか？
RQ2RL においてどのマスキング戦略（空間、時空、空間-時空）が最良の性能を発揮しますか？
RQ3RL において潜在空間再構成はピクセル空間再構成より有効ですか？
RQ4再構成ターゲットはどのように選択すべきか（潜在空間対ピクセル空間）と、それがポリシー学習に与える影響は？

主な発見

MLR はサンプル効率を著しく向上させ、複数のベンチマークで最先端のサンプル効率の良いRL手法を上回る。
DMControl-100k では、MLR は Baseline を平均25.3%、中央値34.7%上回り、PlayVirtual を平均43.5%、中央値35.5%上回る。
DMControl-500k では、MLR が最高の中央値スコアを達成し、平均スコアは最強のSOTA手法と同等である。
Atari-100k では、MLR は IQM 0.432 を達成し、SPR より 28.2%、PlayVirtual より 15.5% 高く、OG 0.522 で SPR および PlayVirtual を上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。