QUICK REVIEW

[论文解读] The NetHack Learning Environment

Heinrich Küttler, Nantas Nardelli|arXiv (Cornell University)|Jun 24, 2020

Reinforcement Learning in Robotics参考文献 56被引用 40

一句话总结

本论文介绍 NetHack Learning Environment (NLE)，一个基于 NetHack 的快速、复杂、程序生成的 RL 基准，提供一套任务、基线，以及对智能体行为和泛化的分析。

ABSTRACT

Progress in Reinforcement Learning (RL) algorithms goes hand-in-hand with the development of challenging environments that test the limits of current methods. While existing RL environments are either sufficiently complex or based on fast simulation, they are rarely both. Here, we present the NetHack Learning Environment (NLE), a scalable, procedurally generated, stochastic, rich, and challenging environment for RL research based on the popular single-player terminal-based roguelike game, NetHack. We argue that NetHack is sufficiently complex to drive long-term research on problems such as exploration, planning, skill acquisition, and language-conditioned RL, while dramatically reducing the computational resources required to gather a large amount of experience. We compare NLE and its task suite to existing alternatives, and discuss why it is an ideal medium for testing the robustness and systematic generalization of RL agents. We demonstrate empirical success for early stages of the game using a distributed Deep RL baseline and Random Network Distillation exploration, alongside qualitative analysis of various agents trained in the environment. NLE is open source at https://github.com/facebookresearch/nle.

研究动机与目标

以一个快速但又丰富复杂的环境来激发强化学习研究，该环境挑战探索、规划、记忆和迁移能力。
提供一个围绕 NetHack 的 Gym 兼容接口，以实现可扩展的实验。
发布一个初始任务集合和基线，以展示在长远目标、符号观察空间中的学习与泛化。
促进对智能体行为、跨种子泛化以及探索策略影响的分析。

提出的方法

将 NLE 实现为基于 NetHack 3.6.6 的 Gym 环境，带有受控种子和通过 Python 前端访问的内部状态。
定义符号化的多模态观测（glyphs, chars, colors, specials, blstats, message, inv_* 字段）以及 93 种动作（77 个命令 + 16 移动）。
使用以自我中心表示为主的架构，包含 glyph 嵌入、2D 卷积，以及用于生成潜在观测的 MLP；再与基于 LSTM 的策略结合。
使用 IMPALA (TorchBeast) 在 1B 步训练基线智能体，采用随机种子和多种角色配置。
在基线中扩展 Random Network Distillation (RND)，以在奖励稀疏、方差高的环境中促进探索。
提供仪表板和重放工具，以分析智能体行为和动作分布。

实验结果

研究问题

RQ1像 NetHack 这样快速、程序生成、符号丰富的环境能否驱动具有长时程规划与探索能力的鲁棒强化学习方法？
RQ2基线的无模型强化学习方法在 NetHack 任务上的表现如何，内在探索奖励（如 RND）对学习与泛化有何影响？
RQ3角色配置、种子多样性和模型容量在对未知种子和更长时程目标的泛化中起什么作用？
RQ4在像 NetHack 这样复杂的多实体符号环境中，智能体学习时会出现哪些定性失败模式和策略？
RQ5NetHack 在评估迁移、终身学习以及从示例学习方面有多大适用性？

主要发现

使用 IMPALA 和 RND 训练的基线智能体能够在 NetHack 的早期阶段跨越多种角色配置学习多样化的策略。
Random Network Distillation 在若干子目标上提供显著收益（例如楼梯导航），在奖励稀疏时有助于探索，尽管效果因任务与角色而异。
随着训练种子集合增大，泛化得到提升；至少在 1000 个种子上训练可以缩小训练与测试性能之间的差距，表明记忆化减少。
智能体表现出明确的失败模式（例如在下降过程中因战斗而死亡、如变色龙等进化威胁），这揭示了对长时程任务需要鲁棒的表示与规划。
符号化观测空间和长回合时长使 NetHack 成为测试泛化、分层规划以及 RL 中的终身学习的合适基准。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。