QUICK REVIEW

[论文解读] SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Zhenglun Kong, Peiyan Dong|arXiv (Cornell University)|Dec 27, 2021

Advanced Neural Network Applications被引用 27

一句话总结

SPViT 引入了一个对延迟敏感的软令牌裁剪框架，用于 Vision Transformers，使用多头令牌选择器和令牌打包来适应每张图像，在边缘设备和 FPGA 上实现显著的延迟降低，且精度损失极小。

ABSTRACT

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViTs, and the related edge device deployment, we leverage the input token sparsity and propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens generated by the selector module into a package token that will participate in subsequent calculations rather than being completely discarded. Our framework is bound to the trade-off between accuracy and computation constraints of specific edge devices through our proposed computation-aware training strategy. Experimental results show that our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification. Moreover, our framework can guarantee the identified model to meet resource specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile platforms. For example, our method reduces the latency of DeiT-T to 26 ms (26%$\sim $41% superior to existing works) on the mobile device with 0.25%$\sim $4% higher top-1 accuracy on ImageNet.

研究动机与目标

Motivate and address the high computation cost of Vision Transformers for edge devices and real-time deployment.
Propose a latency-aware soft token pruning framework (SPViT) that enables per-image adaptive pruning.
Develop an attention-based multi-head token selector and a token packaging technique to preserve information from pruned tokens.
Introduce a latency-aware training strategy to meet hardware latency constraints across devices.
Demonstrate real-time edge deployment of ViT models on mobile devices and FPGA with meaningful latency-accuracy trade-offs.

提出的方法

Insert a lightweight, multi-head token selector across ViT blocks to score token importance per attention head.
Apply soft token pruning by packaging less informative tokens into a package token rather than discarding them, preserving contextual information.
Aggregate head-wise token scores through an attention-based branch and use Gumbel-Softmax for differentiable keep/prune decisions.
Introduce a latency-aware sparsity loss that constrains per-block pruning rates to hardware latency budgets via a latency-sparsity lookup table.
Use layer-to-phase progressive training to determine insertion points and pruning rates, balancing accuracy and hardware latency.
Deploy SPViT on mobile (Samsung Galaxy S20) and FPGA (Xilinx ZCU102) to demonstrate real-time inference and compute-accuracy trade-offs.

实验结果

研究问题

RQ1How can token pruning be made latency-aware to meet device-specific constraints while maintaining accuracy in ViTs?
RQ2Can a soft token pruning approach with token packaging outperform hard pruning and other pruning strategies on edge devices?
RQ3What is the impact of inserting token selectors at different ViT blocks on accuracy and latency?
RQ4How does SPViT perform on lightweight hierarchical ViTs (e.g., Swin, PiT) and on edge hardware?

主要发现

SPViT reduces ViT computation by 31%–43% across backbones with 0.1%–0.5% accuracy loss.
On DeiT-T, SPViT achieves 26 ms latency on mobile and up to 40%–60% latency reduction for other models with negligible accuracy loss.
SPViT enables real-time ViT inference on mobile devices (≤33 ms per image) for DeiT-T, and achieves substantial latency reductions on FPGA with fixed-point implementation.
Token packaging preserves information from pruned tokens and helps maintain accuracy while increasing pruning rate.
SPViT outperforms several state-of-the-art pruning methods in accuracy-latency trade-offs for both flat and hierarchical ViTs.
Latency-aware deployment results show notable improvements on Samsung Galaxy S20 and Xilinx ZCU102 hardware.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。