QUICK REVIEW

[论文解读] DeepViT: Towards Deeper Vision Transformer

Daquan Zhou, Bingyi Kang|arXiv (Cornell University)|Mar 22, 2021

Advanced Neural Network Applications参考文献 46被引用 348

一句话总结

这篇论文分析了为什么 Vision Transformers 在深度增加时性能会饱和，并提出 Re-attention 以再生成多样的注意力图，从而实现非常深的 ViT（例如 32 块）的稳定训练，并在没有额外数据的情况下获得更高的 ImageNet 精度。

ABSTRACT

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet. Code is publicly available at https://github.com/zhoudaquan/dvit_repo.

研究动机与目标

研究为什么在深度上扩展 Vision Transformer 会导致性能饱和。
确定深度 ViT 中注意力崩塌的原因。
提出一种轻量机制（Re-attention），在各层之间多样化注意力。
证明从头训练的更深 ViT 在 ImageNet-1k 上可以提升准确性。

提出的方法

在 ImageNet 上对 ViT 深度扩展进行经验研究，以观察各层之间的注意力图相似性。
通过跨层注意力相似性来定义和量化注意力崩塌。
将 Re-attention 引入为一种可学习的逐头变换，在注意力头之间交换信息。
将 ViT 模块中的 MHSA 替换为 Re-attention，以形成 DeepViT 架构。
在 ImageNet-1k 上对比 DeepViT 与最先进的 CNNs 与 ViTs，且不使用额外数据或增强。
给出嵌入维度与深度的消融研究以及替代的注意力锐化基线。

实验结果

研究问题

RQ1ViT 是否能像 CNN 一样从更深的结构中得到实质性收益，还是会在性能上饱和？
RQ2深度 ViT 中注意力图在各层之间变得相似的原因是什么？
RQ3一种轻量机制是否能跨注意力头复用信息以恢复多样性并使 ViT 变得更深？
RQ4从头开始在 ImageNet-1k 训练的 DeepViT 模型在相似计算量下是否优于现有 SOTA 模型？

主要发现

模型	参数（M）	MAdds（G）	Top-1 准确率 (%)
ResNet50	25	4.0	76.2
ResNet50*	25	4.0	79.0
RegNetY-8GF	40	8.0	79.3
Vit-B/16	86	17.7	77.9
Vit-B/16*	86	17.7	79.3
T2T-ViT-16	21	4.8	80.6
DeiT-S	22	-	79.8
DeepVit-S (Ours)	27	6.2	81.4
DeepVit-S ⋆ (Ours)	27	6.2	82.3
ResNet152	60	11.6	78.3
ResNet152*	60	11.6	80.6
RegNetY-16GF	54	15.9	80.0
Vit-L/16	307	-	76.5
T2T-ViT-24	64	12.6	81.8
DeiT-B	86	-	81.8
DeiT-B*	86	17.7	81.5
DeepVit-L (Ours)	55	12.5	82.2
DeepVit-L ⋆ (Ours)	58	12.8	83.1
DeepVit-L ${}^{\u001bsterisk}oldy${} (Ours)	58	12.8	84.3

直接增加 ViT 深度在使用标准 MHSA 时，会在 ImageNet 上产生饱和甚至下降的准确度。
更深层的注意力图变得高度相似（注意力崩溃），与特征演化停滞相关。
Re-attention 通过可学习的矩阵线性混合跨头的注意力图，保持多样性并消除跨层注意力崩溃。
DeepViT 模型（32 块）实现稳定的准确性提升，在 ImageNet-1k 上超过基线 ViT 以及若干 CNN/ViT SOTA 模型，且不需要额外数据或训练技巧。
将 MHSA 替换为 Re-attention 可获得 0 个相似块的注意力图，并使 32 块模型的 Top-1 准确率提高最多 1.6 个点。
DeepViT-S 与 DeepViT-L 以更少的参数获得具有竞争力或更高的准确率（例如，DeepViT-L 根据变体达到 82.2–83.1%）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。