QUICK REVIEW

[论文解读] Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning

Yuke Zhu, Roozbeh Mottaghi|arXiv (Cornell University)|Sep 16, 2016

Reinforcement Learning in Robotics参考文献 52被引用 163

一句话总结

本文提出一个以目标为驱动的深度强化学习模型，采用Siamese actor-critic架构和AI2-THOR仿真框架，能够在目标和场景间实现泛化，数据效率提升以及 sim-to-real 转移。

ABSTRACT

Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new target goals, and (2) data inefficiency i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to the task of target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows to better generalize. To address the second issue, we propose AI2-THOR framework, which provides an environment with high-quality 3D scenes and physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently. We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and across scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), (4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment. The supplementary video can be accessed at the following link: https://youtu.be/SmBxMDiOrvs.

研究动机与目标

通过将目标纳入策略输入来弥合视觉导航领域深度强化学习中的泛化差距。
开发高质量的仿真环境（AI2-THOR），以实现可扩展的数据收集和现实的室内交互。
提出一个目标驱动的策略，可以在不重新训练的情况下实现跨目标的泛化。
展示端到端可训练性，无需特征工程或显式三维重建。
评估对新目标、新场景、连续空间以及真实机器人转移的泛化能力。

提出的方法

提出一个深度Siamese actor-critic网络，能够并行处理当前观测与目标图像，采用权重共享，生成用于策略和值输出的联合嵌入。
使用场景特定的最终层来捕捉特定布局的导航线索，同时在目标和场景之间共享通用的Siamese层。
将动作离散化为前进/后退和左转/右转，并加入高斯噪声以建模动力学。
固定ImageNet预训练的ResNet-50骨干作为特征提取器，将4帧历史输入堆叠，并将嵌入投影到512维空间。
使用类似A3C的异步协议进行训练，其中每个线程目标不同的导航目标，并相应地更新场景特定层和通用层。
奖励设计包括稀疏的目标到达奖励（10.0）以及一个小的时间惩罚（-0.01），以鼓励更短的轨迹。

实验结果

研究问题

RQ1目标驱动策略是否能够在同一场景内对未见目标进行泛化？
RQ2在未见场景中重新使用学习到的表示时，模型是否能对未见目标进行泛化？
RQ3跨目标共享信息是否比传统DRL基线提升数据效率？
RQ4该方法是否可转移到连续空间和在有限微调下的真实机器人场景？

主要发现

类型	方法	平均轨迹长度
Heuristic	Random walk	2744.3
Heuristic	Shortest path	17.6
Purpose-built RL	One-step Q	2539.2
Purpose-built RL	A3C (1 thread)	1241.3
Purpose-built RL	A3C (4 threads)	723.5
Target-driven RL	Single branch	581.6
Target-driven RL	Final (ours)	210.7

最终的目标驱动模型的平均轨迹显著更短（210.7步），优于包括A3C变体和单分支目标模型在内的基线。
数据效率提升，最终模型在100M训练帧后超越了最先进的DRL方法。
模型在场景内未见目标和未见场景之间具有泛化能力，得益于共享的Siamese层和场景特定层。
t-SNE可视化表明嵌入空间保留空间布局，暗示潜在的定位/映射。
在连续空间任务中，模型以显著更少的步数到达门/目标，尽管需要更多的训练帧。
机器人实验显示通过小幅微调实现成功的仿真到现实转移，并且迁移学习到的通用层可加速收敛。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。