QUICK REVIEW

[论文解读] Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Tianlong Chen, Yu Cheng|arXiv (Cornell University)|Jun 8, 2021

Advanced Neural Network Applications参考文献 95被引用 85

一句话总结

本文提出端到端的视觉变换器（ViTs）稀疏训练，以在保持准确性的同时降低训练内存和推理成本，通过动态稀疏子网络、结构化稀疏以及数据-架构协稀。

ABSTRACT

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We carry out the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. We further co-explore data and architecture sparsity for additional efficiency gains by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can sometimes improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at https://github.com/VITA-Group/SViTE.

研究动机与目标

激发并实现 ViTs 的端到端稀疏性，以降低训练内存和推理成本。
开发在固定参数预算下保持或提升准确性的稀疏 ViT 训练方法。
将稀疏性从非结构化扩展到结构化形式，以提升硬件效率。
共同探索数据稀疏性与架构稀疏性，以获得额外的效率。
在 ImageNet 上的 DeiT 主干网络上展示有效性，带来显著的 FLOPs 与延迟节省。

提出的方法

引入稀疏视觉变换器探索（SViTE），在固定参数预算下训练稀疏 ViTs。
将 SViTE 拓展为结构化稀疏 ViT 探索（S2ViTE），通过引导自注意力头的剪枝与生长实现硬件友好稀疏。
增加稀疏视觉变换器协同探索（SViTE+），通过联合选择信息丰富的输入令牌和对模型进行稀疏。
使用基于泰勒展开的代理来评估头部重要性，以及 L1 范数评估 MLP 神经元以进行剪枝。
引入可学习的令牌选择器，结合 Gumbel-Softmax 与直通技巧以选择前 k 个信息丰富的补丁，实现数据稀疏。

实验结果

研究问题

RQ1端到端稀疏训练是否能够在不牺牲精度的情况下，使 ViTs 获得显著的 FLOPs 和参数数量降低？
RQ2结构化稀疏性（如剪枝注意力头）是否提供相较于非结构化稀疏更利于硬件的增益？
RQ3数据稀疏性（令牌选择）与架构稀疏性的协同探索是否能在不损害性能的前提下带来额外的效率提升？
RQ4稀疏 ViT 模型是否由于隐式正则化而具有更好的泛化，且在某些稀疏性区间可能提高准确性？
RQ5在 ImageNet 上的 DeiT-Tiny/Small/Base 主干网络上，稀疏策略的表现如何？

主要发现

SViTE 产生具有显著 FLOPs 减少的稀疏 DeiT（例如：25.56%–57.50%，取决于主干和稀疏程度），精度损失极小（通常在 0.5% 以内）。
SViTE+，结合令牌选择，在 DeiT-Small 上在 50% 数据稀疏和 5% 模型稀疏时，准确率提升最高可达 4.40%，FLOPs 节省 49.32%，运行时间缩短 4.40%。
S2ViTE 使用结构化稀疏，匹配或超过无结构变体，且在提供显著运行时间缩减的同时可超越密集基线（例如报道的最高达 24.70% 的 reductions）。
数据稀疏性起正则化作用；SViTE+-Small 可将令牌减少最多 10%，带来相应的运行时间与 FLOPs 节省，且有时精度提升。
结构化稀疏（S2ViTE）在若干设定下可优于 SSP，且 S2ViTE-Base 在 40% 结构化稀疏下，精度比密集 DeiT-Base 高出最多 1.24%，同时将 FLOPs 降低约 34%。
在 ImageNet-1K 的 DeiT-Tiny/Small/Base 上，所提出的方法提供一致的效率提升，精度具有竞争力或提升，甚至更小的稀疏网络也能超过更大的密集对手。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。