QUICK REVIEW

[论文解读] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Yongming Rao, Wenliang Zhao|arXiv (Cornell University)|Jun 3, 2021

Advanced Neural Network Applications参考文献 31被引用 310

一句话总结

DynamicViT 在视觉 Transformer 中引入层次化、依输入而定的令牌裁剪，使用轻量级预测模块和注意力屏蔽，在显著降低 FLOPs 的同时几乎不损失准确率。

ABSTRACT

Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. To optimize the prediction module in an end-to-end manner, we propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens. Benefiting from the nature of self-attention, the unstructured sparse tokens are still hardware friendly, which makes our framework easy to achieve actual speed-up. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%~37% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision transformers. Equipped with the dynamic token sparsification framework, DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet. Code is available at https://github.com/raoyongming/DynamicViT

研究动机与目标

通过利用信息量较高的图像补丁的稀疏性来促进视觉 Transformer 的加速。
提出一个动态令牌稀疏化框架，按阶段和输入依赖地裁剪令牌。
开发端到端可训练的预测模块，利用 Gumbel-Softmax 和注意力屏蔽实现可微裁剪。
在 ImageNet 上的多种骨干网络变体中展示显著的 FLOPs 减少和吞吐量提升。

提出的方法

在多个 Transformer 块中插入轻量级预测模块以估计每个令牌的重要性。
从令牌特征计算局部-全局嵌入以预测每个令牌的丢弃/保留概率。
使用 Gumbel-Softmax 采样二值保留/丢弃掩码，同时在训练时保持可微性。
在自注意力中应用注意力屏蔽以移除与裁剪令牌相关的交互，从而在训练时保持计算的一致性。
通过交叉熵损失、对教师骨干的蒸馏损失、KL 散度和比值约束的裁剪损失的组合进行训练。
推理阶段根据学到的分数每阶段裁剪固定数量的令牌，以达到目标保留比率。

实验结果

研究问题

RQ1是否可以通过裁剪无信息量的令牌在不显著损失准确率的情况下加速视觉 Transformer？
RQ2如何在 Transformer 框架内端到端地训练动态令牌裁剪机制？
RQ3分层、输入依赖的令牌裁剪对不同骨干网络的模型效率和准确率有何影响？

主要发现

分层令牌稀疏化可以裁剪多达 66% 的输入令牌，在减少 31%–37% FLOPs 的同时，将吞吐量提高超过 40%，在各骨干上仅有约 0.5% 的准确率下降。
DynamicViT 在 ImageNet 上相较于最先进的 CNN 和视觉 Transformer，达到了有竞争力的复杂度/准确率权衡。
动态裁剪表现出理性的行为，保留靠近图像中心和目标对象的令牌，裁剪周边区域，并且随着裁剪进展显示出可解释性。
该方法为模型扩展提供了一个可行的替代方案，相较于宽度标尺的扩展，利用动态令牌稀疏化实现类似或更好的效率。
较大模型（DeiT-B 和 384x384 输入）从 DynamicViT 中受益显著，FLOPs 大幅减少而准确率下降温和。
消融研究显示动态、学习型裁剪相较静态或随机令牌移除策略的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。