QUICK REVIEW

[论文解读] Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations

Huan Zhang, Hongge Chen|arXiv (Cornell University)|Mar 19, 2020

Adversarial Robustness in Machine Learning被引用 111

一句话总结

本论文提出了一种 state-adversarial MDP (SA-MDP) 框架和一个 principled policy regularizer，使 DRL 对状态观测的对抗扰动更加鲁棒，并在强白盒攻击下对 PPO、DDPG、DQN 的改进进行展示。

ABSTRACT

A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises. Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions. Several works have shown this vulnerability via adversarial attacks, but existing approaches on improving the robustness of DRL under this setting have limited success and lack for theoretical principles. We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks. We propose the state-adversarial Markov decision process (SA-MDP) to study the fundamental properties of this problem, and develop a theoretically principled policy regularization which can be applied to a large family of DRL algorithms, including proximal policy optimization (PPO), deep deterministic policy gradient (DDPG) and deep Q networks (DQN), for both discrete and continuous action control problems. We significantly improve the robustness of PPO, DDPG and DQN agents under a suite of strong white box adversarial attacks, including new attacks of our own. Additionally, we find that a robust policy noticeably improves DRL performance even without an adversary in a number of environments. Our code is available at https://github.com/chenhongge/StateAdvDRL.

研究动机与目标

动机并形式化 DRL 在状态观测中的对抗扰动鲁棒性。
引入 SA-MDP 以捕捉最坏情况的观测扰动并分析基本性质。
提出一个在理论上原理性的鲁棒策略正则化项，适用于多种 DRL 算法（PPO、DDPG、DQN）。
展示在多样环境中对强白盒攻击下的经验鲁棒性提升。

提出的方法

定义 SA-MDP，其中对手通过一个确定性、平稳的函数 nu(s) 在扰动集合 B(s) 内对观测进行扰动。
推导固定策略和对手的 SA-MDP 贝尔曼方程，以及最优对手的收缩性结果。
提出一个与 total variation/KL 散度相关的鲁棒策略正则化项，用以界定对扰动的策略敏感性(方程 5、6、8)。
将正则化项专用于随机策略（PPO），使用基于 KL 的界限以及凸松弛或用于内部最大化的 SGLD（第 3.2 节）。
将正则化项专用于确定性策略（DDPG），通过用高斯噪声对动作进行平滑处理并推导一个可处理的 DDPG 正则化项（方程 6）。
将正则化项专用于 DQN，通过一个类似铰链的项使前行动作对扰动保持鲁棒（方程 8）。

实验结果

研究问题

RQ1在如 PPO、DDPG、DQN 等标准算法下，状态观测的对抗扰动如何影响 DRL 策略？
RQ2是否可以设计一个理论上扎实的正则化项，在离散与连续动作空间上提升鲁棒性？
RQ3是否存在一个有原理的框架（SA-MDP），能够解释鲁棒性的极限并指导算法干预？
RQ4正则化策略在非对抗环境中是否保持性能，同时提升对抗鲁棒性？
RQ5哪些有效的对抗攻击策略能揭示 DRL 代理的鲁棒性差距？

主要发现

SA-MDP 框架揭示，最优对手可能破坏 stationary 最优策略，从而促使鲁棒正则化。
基于 KL/DV 的正则化项与扰动引起的策略发散紧密相关，并在攻击下降低性能损失。
正则化后的 PPO、DDPG 和 DQN 在强白盒攻击下显示显著的鲁棒性提升，包括新的 RS 和 MAD 攻击。
在某些环境中，即使没有对手，正则化也能提升性能，表明其在对抗性以外的更广泛收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。