QUICK REVIEW

[論文レビュー] DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames

Erik Wijmans, Abhishek Kadian|arXiv (Cornell University)|Nov 1, 2019

Image Processing and 3D Reconstruction被引用数 171

ひとこと要約

DD-PPOは同期的で分散型の分散RL手法で、多数のGPUにスケールしてHabitat-SimでPointGoalナビゲータを訓練し、ほぼ線形のスピードアップを達成してRGB-D入力からGPS+Compassを用いたPointGoalNavを解決します。大規模な訓練利得、移移転性、および異種環境における強力なスケーリングを示します。

ABSTRACT

We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task --near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).

研究の動機と目的

リソース集約的な3D体現AIタスクのためのスケーラブルな強化学習を動機づける。
パラメータサーバを用いない、単純で同期的な分散トレーニングフレームワークを提案する。
Habitat-Simでの大規模PPO訓練におけるほぼ線形スケーリングと顕著な性能向上を示す。
学習済み表現が他のナビゲーションタスクへ転移することを示し、RGB-DとRGB入力を比較する。

提案手法

パラメータサーバなしの同期的な分散RL手法であるDecentralized Distributed Proximal Policy Optimization (DD-PPO)を導入する。
ワーカーはGPU加速シミュレータでロールアウトを収集し、PPOによって方策勾配を計算し、AllReduceで勾配を同期する。
異種負荷でのスケーリングを改善するため、低速なロールアウト収集（ストラグラー）を終了させるプリエンプション閾値を実装する。
調整にはTCPStoreを用いたDistributedDataParallel（PyTorch）を使用し、クリッピングとGAEsを用いたPPOを適用する。
Habitatでさまざまなビジュアルエンコーダ（ResNet50、SE-ResNeXt50/101）と訓練データセット（Gibson-4+、Gibson-2+/MP3D）を用いて実験する。
このフレームワークが数百GPU規模および異なるワークロード体制へスケールできることを示す。

実験結果

リサーチクエスチョン

RQ1GPS+CompassとRGB-D入力を含むPointGoalNavの学習可能性の根本的な限界は何か？
RQ2トレーニングデータの増加と異なるビジュアルエンコーダはPointGoalNavの性能にどう影響するか？
RQ3RGB-Dで事前訓練されたPointGoalNavポリシーは関連する体現ナビゲーションタスクへ転移できるか？
RQ4RGBのみがPointGoalNavの性能と解決可能性に与える影響は何か？
RQ5DD-PPOは同質・異質のシミュレーションワークロード全体でどれほどスケーラブルか？

主な発見

DD-PPOはほぼ線形スケーリングを達成し、128 GPUsで直列ベースラインに対して最大107xのスピードアップを提供。
64 GPUsで2.5Bステップ（約80人年）を3日未満で訓練すると、Habitat Challenge 2019 PointGoalNav with GPS+Compassで最先端の結果を得る。
誤差と計算量はべき法則様の傾向をたどる。ピーク性能の90%は約100Mステップ時点で達成される（最速・最も費用対効果が高い利得）。
PointGoalNavポリシーはDD-PPOで学習すると他のタスク（Flee、Explore）へ転移し、転移設定でImageNet事前訓練ベースラインを上回る。
GPS+Compassを備えたRGB-Dはほぼ最短経路性能を達成（SPLは最短経路オラクルに近い）。適切な訓練データでRGBだけは最先端に近づく。GPS+CompassなしのRGBは依然困難。
学習された表現は再利用可能なリソースを提供し、新しいナビゲーションタスクへの迅速な適応を可能にする。）

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。