QUICK REVIEW

[论文解读] Adversarial Policies: Attacking Deep Reinforcement Learning

Adam Gleave, Michael D. Dennis|arXiv (Cornell University)|May 25, 2019

Adversarial Robustness in Machine Learning参考文献 33被引用 92

一句话总结

论文表明，在共享环境中行动的对抗策略可以通过诱导对分布之外的观测来可靠打败固定受害者RL策略，尤其在高维设置中。它分析了原因并探讨防御。

ABSTRACT

Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent's observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial? We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent. Videos are available at https://adversarialpolicies.github.io/.

研究动机与目标

引入一个物理现实的威胁模型，其中对手在一个零和马尔可夫游戏中由对手控制。
证明存在能够可靠击败通过自我对弈训练的最先进受害者的对抗性策略。
分析对抗性策略如何操纵观测和受害者网络激活以导致失败。
研究观测维度的作用并进行消融分析以理解防御前景。

提出的方法

将受害者和攻击者建模为一个两人马尔可夫博弈中的玩家，使用固定的受害者策略。攻击者求解一个强化学习问题，以最大化其在受害者策略嵌入到动力学中的折现奖励。
使用 Proximal Policy Optimization (PPO) 对固定的黑箱受害者训练对抗性策略。
在包含本体感觉观测的零和博弈仿真机器人环境中评估对抗者（Kick and Defend, You Shall Not Pass, Sumo Humans, Sumo Ants）。
将对抗者与基线（Rand、Zero、Zoo 策略）进行比较，并随着时间对中位受害者的胜率进行测量。
使用高斯混合模型和 t-SNE 分析受害者激活，以理解对抗者引起的分布变化。

实验结果

研究问题

RQ1在攻击者无法直接修改受害者观测的多智能体、物理现实的强化学习环境中，是否存在对抗性策略？
RQ2一个对抗性策略能否在对通过自我对弈训练的受害者上超越预训练的 Zoo 基线？
RQ3哪些机制（观测操作、激活位移）使对抗性策略能够击败受害者，观测维度如何影响易受攻击性？
RQ4诸如针对对抗者的微调等防御措施是否能缓解攻击，新的对抗者是否仍然能够击败受过防御的受害者？

主要发现

对抗性策略在多个环境中可可靠击败受害者策略，胜率通常高于 Zoo 基线。
对抗者通过创造自然的、对抗性的观测来获胜，而不是成为普遍强大的对手，从而在受害者网络中诱发分布外的激活。
更高的观测维度增加对对抗性策略的易受攻击性（例如，Sumo Humans 比 Sumo Ants 更易受攻击）。
隐藏对手位置可能会削弱正常对手，但对对抗者有利，显示出非传递性策略交互。
微调对特定对抗者提供部分防御，但针对防守型受害者训练的新对抗者仍然可以成功。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。