QUICK REVIEW

[论文解读] Detecting Adversarial Attacks on Neural Network Policies with Visual Foresight

Yen-Chen Lin, Ming-Yu Liu|arXiv (Cornell University)|Oct 2, 2017

Adversarial Robustness in Machine Learning参考文献 41被引用 34

一句话总结

本文提出了一种防御机制，通过使用动作条件帧预测模型比较观测帧与预测帧的动作分布，检测深度强化学习策略的对抗性攻击。该方法在检测对抗性攻击方面表现出鲁棒性，并在检测到对抗性输入时切换至预测结果，从而在Atari 2600环境中优于基线方法，保持了性能。

ABSTRACT

Deep reinforcement learning has shown promising results in learning control policies for complex sequential decision-making tasks. However, these neural network-based policies are known to be vulnerable to adversarial examples. This vulnerability poses a potentially serious threat to safety-critical systems such as autonomous vehicles. In this paper, we propose a defense mechanism to defend reinforcement learning agents from adversarial attacks by leveraging an action-conditioned frame prediction module. Our core idea is that the adversarial examples targeting at a neural network-based policy are not effective for the frame prediction model. By comparing the action distribution produced by a policy from processing the current observed frame to the action distribution produced by the same policy from processing the predicted frame from the action-conditioned frame prediction module, we can detect the presence of adversarial examples. Beyond detecting the presence of adversarial examples, our method allows the agent to continue performing the task using the predicted frame when the agent is under attack. We evaluate the performance of our algorithm using five games in Atari 2600. Our results demonstrate that the proposed defense mechanism achieves favorable performance against baseline algorithms in detecting adversarial examples and in earning rewards when the agents are under attack.

研究动机与目标

解决基于深度神经网络的强化学习策略在自动驾驶等安全关键应用中对对抗性样本的脆弱性问题。
开发一种基于时间一致性与动作条件帧预测的防御机制，用于在序列决策任务中检测对抗性输入。
通过基于预测帧而非受损观测的行动建议，使智能体在遭受攻击时仍能继续执行任务。
构建一种模型无关的防御方法，训练过程中无需对抗性样本，且在多种基于深度神经网络的策略中均有效。

提出的方法

训练一个动作条件帧预测模型（视觉预见模块），从历史帧和动作中预测当前帧。
将预测帧输入相同的策略网络，并将其动作分布与从观测帧得到的动作分布进行比较。
当观测帧与预测帧的动作分布显著偏离时，检测到对抗性攻击。
当触发对抗性检测时，从使用观测帧切换为使用预测帧，使智能体能够继续行动。
利用多帧和多动作之间的时间一致性，提升对单帧对抗性扰动的检测鲁棒性。
使用帧预测的均方误差（MSE）作为模型准确性的代理指标，其与检测性能密切相关。

实验结果

研究问题

RQ1时间一致性与动作条件帧预测能否用于检测深度强化学习策略中的对抗性样本？
RQ2帧预测模型的准确性如何影响对抗性样本的检测性能？
RQ3通过依赖预测帧，智能体是否能在持续的对抗性攻击下维持任务性能？
RQ4与基于图像分类的现有对抗性检测方法相比，该方法在序列决策设置中的表现如何？
RQ5该防御对知晓检测机制的自适应攻击者是否依然有效？

主要发现

与图像分类中的强基线检测器相比，所提出的防御在检测对抗性样本方面实现了更高的平均平均精度（mAP）。
检测性能与帧预测准确性密切相关，帧预测模型MSE越低，mAP越高，表明性能提升。
在Atari 2600环境中，即使在大量时间步受到攻击的情况下，智能体通过切换至预测帧仍能保持高奖励性能。
由于帧预测模型对未针对其本身的对抗性扰动具有鲁棒性，该方法在先前帧可能已被污染的情况下依然有效。
该防御是模型无关的，训练过程中无需对抗性样本，因此可广泛应用于各种基于深度神经网络的策略。
与现有防御方法（如对抗性训练或防御蒸馏）集成是可行的，且可能具有协同效应，因为其时间信息的使用方式具有正交性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。