QUICK REVIEW

[论文解读] FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Pavan Kumar Anasosalu Vasu, James Gabriel|arXiv (Cornell University)|Mar 24, 2023

Advanced Neural Network Applications被引用 45

一句话总结

FastViT 引入 RepMixer，一种可重参数化的令牌混合器，用于快速混合视图变换器，在移动端和 GPU 平台上实现精度与延迟的优越权衡，同时支持多种视觉任务。

ABSTRACT

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models. Code and models are available at https://github.com/apple/ml-fastvit.

研究动机与目标

在移动设备和桌面端推动在准确性与延迟之间取得平衡的高效视觉模型。
开发一种混合架构，利用卷积与变换器的优势。
通过结构重参数化跳跃连接降低内存访问成本。
在训练时超参数化与大核实现下提升模型容量，而不增加延迟。
展示在分类、检测、分割和3D 手部网格估计等任务中的鲁棒性与泛化性。

提出的方法

引入 RepMixer，一种在推理时可重参数化、移除跳跃连接的令牌混合器。
用因子化的深度卷积替代密集 k×k 卷积，随后是逐点卷积，辅以线性训练时超参数化。
在 FFN 和 patch embedding 层引入大核卷积，以替代早期自注意力。
使用由深度卷积生成的条件位置编码。
在 stem、patch embedding 和 projection 层进行训练时超参数化以提升容量。

实验结果

研究问题

RQ1可重参数化的令牌混合器是否能在不影响精度的前提下降低内存访问成本与延迟？
RQ2线性训练时超参数化在因子化卷积设计下是否提升精度？
RQ3在早期阶段使用大核卷积相比自注意力，在混合架构中是否能提供对延迟友好的精度提升？
RQ4在真实世界延迟约束下，FastViT 在图像分类、检测、分割和3D 手部网格估计中的表现如何？
RQ5相对于竞争架构，模型在鲁棒性和分布外输入方面是否更强？

主要发现

FastViT 在移动端（iPhone 12 Pro）和桌面 GPU（RTX-2080Ti）实现了优越的延迟-精度权衡，同时保持具有竞争力的精度。
FastViT-S12 在 ImageNet-1k 上达到 83.9% Top-1，在移动设备上比 EfficientNet-B5 快 4.9×、比 ConvNeXt-B 快 1.9×（在相似 FLOPs 下），在 GPU 上比 EfficientNetV2-S 快 1.6×。
FastViT-S12 在 iPhone 上比 MobileOne-S4 快 26.3%，在 GPU 上快 26.9%，在相似精度下；FastViT-MA36 的参数量和 FLOPs 下降的同时达到或超过若干最先进模型的性能。
RepMixer 通过移除跳跃连接降低内存访问成本，特别是在较高输入分辨率（如 384×384 和 1024×1024）时实现更低的延迟。
在 stem、patch embedding 和 projection 层的训练时超参数化带来精度提升（如 ImageNet 上 Top-1 提升多达 0.9%），且训练时间开销适中。
在 FFN 和 patch embedding 层中的大核卷积提供鲁棒性和精度提升，延迟影响适中。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。