QUICK REVIEW

[論文レビュー] DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames

Erik Wijmans, Abhishek Kadian|arXiv (Cornell University)|Nov 1, 2019

Advanced Neural Network Applications参考文献 37被引用数 36

ひとこと要約

DD-PPO は、Habitat-Sim で体を持つエージェントを訓練するために多数の GPU にまたがってスケールする、同期的で分散された Proximal Policy Optimization (PPO) 手法です。RGB-D および GPS+Compass を用いた PointGoal ナビゲーションをほぼ完璧に達成し、関連タスクへの転移を可能にします。

ABSTRACT

We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task --near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).

研究の動機と目的

リソース集約的な 3D シミュレーション環境のための、シンプルでスケーラブルな分散 RL フレームワークを示す。
同期的な勾配更新を用いた複数 GPU でのトレーニングのほぼ線形スケーリングを示す。
PointGoal Navigation タスクでエージェントを訓練し、ほぼ完璧な性能を達成するとともに、学習可能性の限界を分析する。
学習されたポリシーの関連ナビゲーションタスクへの転移を評価する。
再現性と再利用を可能にするため、コードとモデルを公開する。

提案手法

分散サーバを持たない分散型（Decentralized）、マルチマシンで分散、同期的（古い勾配がない）な Decentralized Distributed Proximal Policy Optimization（DD-PPO）を提案する。
N ワーカー間で勾配を平均化してローカルモデルを更新するために、同期的 AllReduce を使用する。
異種のワークロードにおけるストラグラー遅延を減らすため、遅い rollout コレクションを終了させる事前中断閾値を導入する。
Habitat-Sim で GP ベースの画像入力（RGB-D または RGB）と GPS+Compass センサを用いた on-policy PPO を適用する。
SE-ResNeXt エンコーダと LSTM ベースのポリシー/値ヘッドを用いて Gibson および Matterport3D データセットでスケーラブルなワークロードを実験する。
PyTorch DistributedDataParallel と TCPStore を用いた協調実装を提供する。

実験結果

リサーチクエスチョン

RQ1DD-PPO は多数の GPU に跨るリソース集約型 3D シミュレータ上でほぼ線形スケーリングを達成できるか？
RQ2GPS+Compass と RGB-D 入力を用いた PointGoal Navigation の fundamental learnability limits は何か？
RQ3訓練データとエンコーダ容量を増やすと PointGoalNav の性能は向上するか、低品質の再構成に対してどの程度頑健か。
RQ4RGB のみ、または GPS+Compass なしでナビゲーションは解決可能か、どの程度か。
RQ5PointGoalNav で訓練されたポリシーは、関連する embodied AI タスク（例: Flee, Explore）へ効果的に転移できるか。

主な発見

Training Dataset	Agent Visual Encoder	Validation SPL	Validation Success	Test SPL	Test Success
Gibson-4+	ResNet50	0.922 ± 0.004	0.967 ± 0.003	0.917	0.970
Gibson-4+ and MP3D	ResNet50	0.956 ± 0.002	0.996 ± 0.002	0.941	0.996
Gibson-2+	ResNet50	0.956 ± 0.003	0.994 ± 0.002	0.944	0.982
Gibson-4+	SE-ResNeXt50	0.959 ± 0.002	0.999 ± 0.001	0.943	0.988
Gibson-4+ and MP3D	SE-ResNeXt101 + 1024-d LSTM	0.969 ± 0.002	0.997 ± 0.001	0.948	0.980

DD-PPO は Habitat-Sim でほぼ線形のスケーリングを達成し、顕著な高速化を実現（例: 128 GPU での実装対照で 107x）。
2.5B steps の訓練（≈80 人間年分の経験）により、Habitat Autonomous Navigation Challenge 2019 で最先端の SPL を達成し、GPS+Compass および RGB-D を使用した unseen 環境で最短経路効率に近い。
誤差と計算量はべき法則に似た分布をたどる。ピーク性能の 90% は約 100M steps 付近で達成され、初期コストの大幅な節約を可能にする。
GPS-D 検出と強力な視覚エンコーダ（SE-ResNeXt101 + 1024-d LSTM）で val 0.969、test 0.948 の SPL を達成し、最短経路オラクルから約 3–5% 程度離れている；RGB のみでも十分なデータがあれば最先端に近い。
学習済み表現を Flee および Explore タスクへ転移させると ImageNet 事前学習ベースラインを上回り、学習したシーン理解がナビゲーションに広く有用であることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。