QUICK REVIEW

[论文解读] Playing FPS Games with Deep Reinforcement Learning

Guillaume Lample, Devendra Singh Chaplot|arXiv (Cornell University)|Sep 18, 2016

Reinforcement Learning in Robotics被引用 111

一句话总结

作者开发了一个基于 DRQN 的代理，用于 ViZDoom 的 3D FPS 死亡竞赛，辅以游戏特征训练和导航-动作拆分，达到超越人类的表现并且训练更快。

ABSTRACT

Advances in deep reinforcement learning have allowed autonomous agents to perform well on Atari games, often outperforming humans, using only raw pixels to make their decisions. However, most of these games take place in 2D environments that are fully observable to the agent. In this paper, we present the first architecture to tackle 3D environments in first-person shooter games, that involve partially observable states. Typically, deep reinforcement learning methods only utilize visual input for training. We present a method to augment these models to exploit game feature information such as the presence of enemies or items, during the training phase. Our model is trained to simultaneously learn these features along with minimizing a Q-learning objective, which is shown to dramatically improve the training speed and performance of our agent. Our architecture is also modularized to allow different models to be independently trained for different phases of the game. We show that the proposed architecture substantially outperforms built-in AI agents of the game as well as humans in deathmatch scenarios.

研究动机与目标

通过使用循环网络来解决 3D FPS 环境中的部分观测问题。
通过游戏特征增强提高训练效率和性能。
通过将任务分解为导航阶段和行动阶段并采用模块化网络来提升学习速度。
展示对未知地图的泛化能力，并与人类玩家及内置机器人进行比较。

提出的方法

在 DRQN 架构上构建，拥有两个视觉流：CNN 输出同时输入到 LSTM 和一个辅助特征头。
在训练期间用二进制游戏特征指示（敌人/物品的存在）来增强输入，以引导卷积滤波器。
引入两阶段架构：用于探索的导航网络（DQN）和用于战斗的行动网络（带特征的增强型 DRQN）；通过敌人是否存在来决定阶段。
将游戏特征与 Q 学习目标共同训练，使特征检测指导策略学习。
应用奖励塑形以缓解稀疏/延迟奖励，并使用帧跳跃来加速训练。
使用带最小历史的顺序 DRQN 更新以稳定学习。

实验结果

研究问题

RQ1基于 DRQN 的代理是否能够在部分可观测的 3D FPS 环境中学习到有效策略？
RQ2在训练期间结合游戏引擎特征（即使在测试时不可用）是否能加速学习并提高性能？
RQ3相比于单一的整体网络，分而治之的导航/行动架构是否提高了训练效率和最终性能？
RQ4该方法对未知地图的泛化能力如何，与人类玩家和内置机器人相比如何？

主要发现

带有游戏特征的增强型 DRQN 在死亡竞赛任务上显著优于基线 DRQN 的性能。
具导航感知的模块化比单一网络获得更好结果，减少“蹲点”行为并改善地图探索。
通过游戏特征共同训练，敌人检测准确度在几小时训练后达到约 90%，加速学习。
在 ViZDoom 的死亡竞赛中，该代理在单人和多人设置下均优于内置 Doom 机器人和人类玩家（单人：Human 1.52 vs Agent 5.12；多人：Human 0.49 vs Agent 1.33，K/D 比）。
使用导航时代理获得更高的物品收集和 K/D 比（例如全死亡竞赛因武器/配件拾取而取得更大提升）。
最大化地，代理在具备游戏特征时达到 K/D 比超过 4.0，且该架构支持对未知地图的泛化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。