[论文解读] Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations
EViT 通过在训练时识别注意的 token 并将不太注意的 token 融合来重新组织 Vision Transformer 的 token,从而在不增加参数的情况下加速推理,在相同成本下提升效率或精度。
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA. Examples include that tokens containing semantically meaningless or distractive image backgrounds do not positively contribute to the ViT predictions. In this work, we propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training. For each forward inference, we identify the attentive image tokens between MHSA and FFN (i.e., feed-forward network) modules, which is guided by the corresponding class token attention. Then, we reorganize image tokens by preserving attentive image tokens and fusing inattentive ones to expedite subsequent MHSA and FFN computations. To this end, our method EViT improves ViTs from two perspectives. First, under the same amount of input image tokens, our method reduces MHSA and FFN computation for efficient inference. For instance, the inference speed of DeiT-S is increased by 50% while its recognition accuracy is decreased by only 0.3% for ImageNet classification. Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images. An example is that we improve the recognition accuracy of DeiT-S by 1% for ImageNet classification at the same computational cost of a vanilla DeiT-S. Meanwhile, our method does not introduce more parameters to ViTs. Experiments on the standard benchmarks show the effectiveness of our method. The code is available at https://github.com/youweiliang/evit
研究动机与目标
- 通过识别自注意力层 MHSA 中的 token 层冗余,推动 Vision Transformer (ViTs) 的加速。
- 提出在训练时进行 token 重组,以保留注意到的 token 并融合不重要的 token。
- 证明 EViT 在推理时在不增加参数的情况下减少计算量(包括 MHSA 和 FFN)。
- 通过允许更多 token(更高分辨率的输入),在同一计算预算下提升精度。
- 探讨使用 oracle 指引 token 相关性并与现有加速方法进行比较的效果。
提出的方法
- 在 MHSA 头上计算 class token 对每个图像 token 的平均注意度。
- 保留前 k 个最具注意力的 token,并将不关注的 token 融合为一个单一的融合 token。
- 使用它们的注意度作为权重进行加权平均来融合不关注的 token(x_fused = sum_{i in N} a_i x_i)。
- 在选定的层中将 token 重组并入 ViT 训练,保留率使用余弦调度。
- 可选地使用 oracle ViT 进行训练以识别重要 token,并用 oracle 权重初始化 EViT。
- 通过在相同计算成本下喂入更多 token 来演示更高分辨率的训练,并用 ImageNet 实验进行验证。
实验结果
研究问题
- RQ1Can token reorganization during ViT training reduce inference cost while maintaining accuracy?
- RQ2Does fusing inattentive tokens preserve more information and stabilize training compared to simple token removal?
- RQ3How does EViT perform under fixed compute and when given higher input resolutions?
- RQ4What is the impact of using an oracle ViT to guide token selection on accuracy and efficiency?
主要发现
- EViT can speed up DeiT-S inference by about 50% with only ~0.3% accuracy loss at ImageNet.
- EViT achieves higher throughput at the same MACs and can maintain or improve accuracy when using higher-resolution inputs (e.g., DeiT-S gains 1% top-1 at the same compute).
- Inattentive token fusion helps preserve information and improves training stability and accuracy over token pruning alone.
- Training with an oracle further improves accuracy (e.g., DeiT-S from 79.8% to 80.7% in an oracle setup) while maintaining or reducing compute.
- Compared with DynamicViT, EViT delivers better accuracy at the same compute with fewer parameters and shows continued gains with longer training.
- EViT can be applied to different ViT variants (DeiT and LV-ViT) and yields favorable accuracy-throughput trade-offs across settings.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。