[논문 리뷰] SPViT: Enabling Faster Vision Transformers via Soft Token Pruning
SPViT는 이미지별로 적응하는 다중 헤드 토큰 선택기와 토큰 패키징을 사용하여 대기시간 인식형 소프트 토큰 프루닝 프레임워크를 비전 트랜스포머에 도입합니다. 엣지 디바이스와 FPGA에서 최소한의 정확도 손실로 의미 있는 지연 감소를 달성합니다.
Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViTs, and the related edge device deployment, we leverage the input token sparsity and propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens generated by the selector module into a package token that will participate in subsequent calculations rather than being completely discarded. Our framework is bound to the trade-off between accuracy and computation constraints of specific edge devices through our proposed computation-aware training strategy. Experimental results show that our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification. Moreover, our framework can guarantee the identified model to meet resource specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile platforms. For example, our method reduces the latency of DeiT-T to 26 ms (26%$\sim $41% superior to existing works) on the mobile device with 0.25%$\sim $4% higher top-1 accuracy on ImageNet.
연구 동기 및 목표
- Motivate and address the high computation cost of Vision Transformers for edge devices and real-time deployment.
- Propose a latency-aware soft token pruning framework (SPViT) that enables per-image adaptive pruning.
- Develop an attention-based multi-head token selector and a token packaging technique to preserve information from pruned tokens.
- Introduce a latency-aware training strategy to meet hardware latency constraints across devices.
- Demonstrate real-time edge deployment of ViT models on mobile devices and FPGA with meaningful latency-accuracy trade-offs.
제안 방법
- Insert a lightweight, multi-head token selector across ViT blocks to score token importance per attention head.
- Apply soft token pruning by packaging less informative tokens into a package token rather than discarding them, preserving contextual information.
- Aggregate head-wise token scores through an attention-based branch and use Gumbel-Softmax for differentiable keep/prune decisions.
- Introduce a latency-aware sparsity loss that constrains per-block pruning rates to hardware latency budgets via a latency-sparsity lookup table.
- Use layer-to-phase progressive training to determine insertion points and pruning rates, balancing accuracy and hardware latency.
- Deploy SPViT on mobile (Samsung Galaxy S20) and FPGA (Xilinx ZCU102) to demonstrate real-time inference and compute-accuracy trade-offs.
실험 결과
연구 질문
- RQ1How can token pruning be made latency-aware to meet device-specific constraints while maintaining accuracy in ViTs?
- RQ2Can a soft token pruning approach with token packaging outperform hard pruning and other pruning strategies on edge devices?
- RQ3What is the impact of inserting token selectors at different ViT blocks on accuracy and latency?
- RQ4How does SPViT perform on lightweight hierarchical ViTs (e.g., Swin, PiT) and on edge hardware?
주요 결과
- SPViT reduces ViT computation by 31%–43% across backbones with 0.1%–0.5% accuracy loss.
- On DeiT-T, SPViT achieves 26 ms latency on mobile and up to 40%–60% latency reduction for other models with negligible accuracy loss.
- SPViT enables real-time ViT inference on mobile devices (≤33 ms per image) for DeiT-T, and achieves substantial latency reductions on FPGA with fixed-point implementation.
- Token packaging preserves information from pruned tokens and helps maintain accuracy while increasing pruning rate.
- SPViT outperforms several state-of-the-art pruning methods in accuracy-latency trade-offs for both flat and hierarchical ViTs.
- Latency-aware deployment results show notable improvements on Samsung Galaxy S20 and Xilinx ZCU102 hardware.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.