QUICK REVIEW

[论文解读] The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning

Yujin Tang, David Ha|arXiv (Cornell University)|Sep 7, 2021

Cellular Automata and Applications被引用 27

一句话总结

本文介绍 AttentionNeuron，一种对置换不变的强化学习架构，其中每个感知输入由其自身模块处理，信息通过注意力机制汇聚形成全局策略；它对任意输入顺序和噪声信道保持鲁棒，在若干强化学习任务中表现良好，并通过额外的行为克隆来提升对置换不变性的表现。

ABSTRACT

In complex systems, we often observe complex global behavior emerge from a collection of agents interacting with each other in their environment, with each individual agent acting only on locally available information, without knowing the full picture. Such systems have inspired development of artificial intelligence algorithms in areas such as swarm optimization and cellular automata. Motivated by the emergence of collective behavior from complex cellular systems, we build systems that feed each sensory input from the environment into distinct, but identical neural networks, each with no fixed relationship with one another. We show that these sensory networks can be trained to integrate information received locally, and through communication via an attention mechanism, can collectively produce a globally coherent policy. Moreover, the system can still perform its task even if the ordering of its inputs is randomly permuted several times during an episode. These permutation invariant systems also display useful robustness and generalization properties that are broadly applicable. Interactive demo and videos of our results: https://attentionneuron.github.io/

研究动机与目标

激励让全球行为从本地信息代理中涌现，而不依赖固定的输入顺序的学习系统。
开发能够处理任意排序的感知输入的置换不变架构。
在输入被置换或加入噪声时展示鲁棒性和泛化特性。
探索包括行为克隆在内的训练方案，将现有策略转换为置换不变形式。

提出的方法

每个观测被视为一个无序的、可变长度的输入列表，每个输入由一个共享的感知神经元模块处理。
在 AttentionNeuron 内，每个感知神经元计算消息 f_k(o_t[i], a_{t-1}) 和 f_v(o_t[i])，在神经元之间使用共享函数。
一个注意力机制将这些消息聚合为全局潜在编码 m_t，使其对输入置换不变。
注意力使用固定的 Q 库，并学习 K(o_t, a_{t-1}) 与 V(o_t) 以通过类似变换器的注意力方程计算 m_t。
Q 与输入解耦，以在输入数量变化时实现置换不变性。
在视觉任务中，输入补丁以类似方式处理，f_k 中具有时序记忆，且有可选的归一化步骤以稳定学习。
该方法在 CartPole、PyBullet Ant、Atari Pong 和 CarRacing 上进行评估，论文中提供输入表示和网络维度的细节。

实验结果

研究问题

RQ1一种神经架构能否处理任意长度、被置换的输入流，仍然产生连贯的全局策略？
RQ2置换不变处理如何影响对输入噪声和未见观测置换的鲁棒性？
RQ3置换不变性对在新背景或视觉变化下的泛化有何影响？
RQ4是否可以通过行为克隆从现有策略学习到置换不变策略？
RQ5AttentionNeuron 层在各种环境中如何与下游RL策略交互？

主要发现

在一个回合中即使输入被随机置换，使用 AttentionNeuron 的智能体仍能完成任务。
置换不变表示提升对未知情形和嘈杂输入的鲁棒性与泛化能力。
在视觉任务中，模型即使只使用部分补丁也能运作，测试时再加入额外补丁仍有益处。
行为克隆可以将非PI策略转换为PI策略，且较大的下游网络在高维观测上的BC性能更好。
该方法能够处理可变数量的输入，并展示基于注意力的输入组织的意义性，通过定性可视化和 t-SNE 嵌入得到证据。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。