QUICK REVIEW

[论文解读] Towards Interpretable Reinforcement Learning Using Attention Augmented Agents

A. Mott, Daniel Zoran|arXiv (Cornell University)|Jun 6, 2019

Multimodal Machine Learning Applications被引用 42

一句话总结

本论文介绍了一种用于 Atari 的软性自上而下注意力强化学习智能体，它使用显式的注意力瓶颈使决策更具可解释性，同时实现了具有竞争力的性能。

ABSTRACT

Inspired by recent work in attention models for image captioning and question answering, we present a soft attention model for the reinforcement learning domain. This model uses a soft, top-down attention mechanism to create a bottleneck in the agent, forcing it to focus on task-relevant information by sequentially querying its view of the environment. The output of the attention mechanism allows direct observation of the information used by the agent to select its actions, enabling easier interpretation of this model than of traditional models. We analyze different strategies that the agents learn and show that a handful of strategies arise repeatedly across different games. We also show that the model learns to query separately about space and content (`where' vs. `what'). We demonstrate that an agent using this mechanism can achieve performance competitive with state-of-the-art models on ATARI tasks while still being interpretable.

研究动机与目标

通过引入暴露代理信息使用情况的注意力瓶颈来推动可解释的 RL。
开发一个软注意力机制，带有自上而下的查询，选择性地从视觉输入中检索与任务相关的信息。
证明注意力映射揭示一致的策略，并且模型在新状态下也能泛化，同时在 Atari 任务上保持性能。

提出的方法

提出一种软注意力模型，其中基于 LSTM 的查询网络对视觉核心输出生成多个注意力头。
将视觉核心输出分成 Keys 和 Values，添加一个固定的空间基，再通过内积和空间 softmax 计算注意力。
聚合被注意的 Values 以生成输入到基于 LSTM 的策略与价值估计流水线。
端到端地用反向传播训练，采用 IMPALA 风格的 actor-learner 架构和 V-trace 损失。
与非注意基线（Feedforward baseline 和 LSTM baseline）进行比较，以评估注意力在性能提升和可解释性方面的增益。

实验结果

研究问题

RQ1自上而下的软注意力机制是否可以在不牺牲性能的情况下，在强化学习中提供一个可解释的瓶颈？
RQ2注意力映射是否揭示有意义的、与任务相关的焦点（例如玩家、敌人、触发器）并展示对未见状态的泛化？
RQ3将这些注意头如何分离成“what”和“where”成分，以及这如何影响决策？
RQ4将自上而下的注意力融入是否比自下而上的显著性分析更有助于可视化和理解代理的策略与价值估计？

主要发现

Attention agent 在 ATARI 任务上取得与 state-of-the-art 基线相当的性能（见 Table 1）。
注意力头揭示可解释的模式，如聚焦于玩家、敌人、加分道具和得分，一些头执行前瞻性规划/扫描。
代理对新颖的视觉配置（例如注入对象）具有泛化能力，并以因果驱动的方式关注新信息，而不是记忆化模式。
观察到“what”和“where”查询的显著混合，一些头跟踪对象，而其他头则充当触发器或地平线扫描。
相比自下而上的注意变体，自上而下的注意在策略和值显著性对齐方面表现更好，支持其可解释性优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。