QUICK REVIEW

[论文解读] On the Adversarial Robustness of Visual Transformers

Rulin Shao, Zhouxing Shi|arXiv (Cornell University)|Mar 29, 2021

Adversarial Robustness in Machine Learning参考文献 54被引用 50

一句话总结

本文首次对视觉变换器（ViTs）的对抗鲁棒性进行了全面分析，表明与卷积神经网络（CNNs）相比，ViTs在对抗扰动下表现出更优的鲁棒性。研究归因于ViTs学习到更具泛化能力的高层次特征，低层次信息更少，且对高频噪声的敏感性更低；而混合架构或模型规模的增加并不能一致地提升鲁棒性。

ABSTRACT

Following the success in advancing natural language processing and understanding, transformers are expected to bring revolutionary changes to computer vision. This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs). We summarize the following main observations contributing to the improved robustness of ViTs: 1) Features learned by ViTs contain less low-level information and are more generalizable, which contributes to superior robustness against adversarial perturbations. 2) Introducing convolutional or tokens-to-token blocks for learning low-level features in ViTs can improve classification accuracy but at the cost of adversarial robustness. 3) Increasing the proportion of transformers in the model structure (when the model consists of both transformer and CNN blocks) leads to better robustness. But for a pure transformer model, simply increasing the size or adding layers cannot guarantee a similar effect. 4) Pre-training on larger datasets does not significantly improve adversarial robustness though it is critical for training ViTs. 5) Adversarial training is also applicable to ViT for training robust models. Furthermore, feature visualization and frequency analysis are conducted for explanation. The results show that ViTs are less sensitive to high-frequency perturbations than CNNs and there is a high correlation between how well the model learns low-level features and its robustness against different frequency-based perturbations.

研究动机与目标

调查视觉变换器（ViTs）与卷积神经网络（CNNs）在对抗鲁棒性方面的对比。
识别影响ViTs对对抗攻击鲁棒性的架构与训练因素。
分析低层次特征学习在决定对基于频率扰动的鲁棒性中的作用。
评估在大规模数据集上进行对抗训练和预训练对提升ViT鲁棒性的有效性。

提出的方法

在多个数据集上对ViTs和CNNs进行了广泛的白盒攻击与迁移攻击评估。
通过特征可视化与频率分析，比较ViTs与CNNs对对抗扰动的响应方式。
通过集成卷积模块或tokens-to-tokens模块，对ViT架构进行修改，以评估其对鲁棒性的影响。
通过调整混合模型中Transformer模块的比例，研究架构组成与鲁棒性之间的关系。
对ViTs应用对抗训练，以评估其在提升鲁棒性方面的有效性。
分析低层次特征学习与对高频扰动敏感性之间的相关性。

实验结果

研究问题

RQ1在白盒与迁移攻击设置下，视觉变换器的对抗鲁棒性与卷积神经网络相比如何？
RQ2ViTs中的哪些架构组件或设计选择促使其对对抗扰动的鲁棒性提升？
RQ3在ViTs中引入卷积模块或tokens-to-tokens模块，如何影响其鲁棒性与特征表示？
RQ4在大规模数据集上进行预训练在多大程度上能提升ViTs的对抗鲁棒性？
RQ5对抗训练能否有效增强视觉变换器模型的鲁棒性？

主要发现

在各种攻击设置下，视觉变换器在对抗鲁棒性方面显著优于CNNs，尤其在迁移攻击中表现更优。
ViTs学习到的特征包含更少的低层次信息且更具泛化能力，这有助于其对对抗扰动的鲁棒性。
在ViTs中引入卷积模块或tokens-to-tokens模块可提升干净准确率，但会因对低层次特征更敏感而降低对抗鲁棒性。
在混合模型中增加Transformer模块的比例可提升鲁棒性，但单纯通过增加纯ViT模型的深度或宽度并不能保证鲁棒性提升。
尽管对有效训练ViTs至关重要，但在更大规模数据集上进行预训练并不会显著提升其对抗鲁棒性。
对抗训练对ViTs有效，可用于训练出鲁棒的视觉变换器模型，证实了其在该架构中的适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。