QUICK REVIEW

[论文解读] LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Ben Graham, Alaaeldin El-Nouby|arXiv (Cornell University)|Apr 2, 2021

Advanced Neural Network Applications参考文献 66被引用 91

一句话总结

LeViT 提出了一种金字塔结构的 Vision Transformer，在比 DeiT 更窄的块下仍保持速度竞争力，同时由于设计选择如更宽的块和减少 MLP 展开，使推理速度更快。补充材料提供了详细的块级时间和注意力偏差可视化。

ABSTRACT

We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https://github.com/facebookresearch/LeViT

研究动机与目标

通过重新设计块结构和金字塔结构来推动 Vision Transformer 的更快推理。
在可比分辨率和计算预算下表征 LeViT 块与 DeiT 块的运行时。
研究金字塔结构和块宽度对整体效率的影响。
提供消融研究和可视化以解释 LeViT 块中注意力行为。

提出的方法

在 14x14 分辨率下对 DeiT-tiny 与 LeViT-256 块设计进行对比，并计算并排的运行时。
分析 LayerNorm、Q/K、V、QK^T、AV、注意力投影以及 MLP 对总运行时的贡献。
展示移除金字塔结构以及加宽/调整块以理解效率提升的消融研究。
可视化注意力偏差映射，以解释不同头部如何关注相对像素位置。

实验结果

研究问题

RQ1LeViT 是否在金字塔/卷积网络启发的设计下实现与 DeiT 相当或更快的推理？
RQ2金字塔结构和块宽度如何影响运行时组件和整体效率？
RQ3将 MLP 展开减小以及注意力计算对速度有何影响？
RQ4注意力偏差可视化揭示了在 LeViT 块中头部的专门化和信息流动吗？

主要发现

LeViT-256 的总运行时与 DeiT-tiny 相近，在同一基准设置下，LeViT 的总运行时约为 2365 μs，而 DeiT-tiny 为 2474 μs。
LeViT 在 QK^T 上花费更少时间，而在后续 AV 乘积上花费更多时间，尽管块宽度更宽（C=256 对比 C=192）。
LeViT 将 MLP 展开因子从四减半到二，从而降低 MLP 运行时，抵消了一些与宽度相关的成本。
注意力偏差可视化显示某些头部关注邻近像素，而其他头部在各阶段呈现均匀或有方向性的模式，说明了多样的注意力策略。
消融研究展示了移除金字塔结构或加宽块对整体性能和 FLOP 计数的影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。