QUICK REVIEW

[论文解读] Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Jiashi Li, Xin Xia|arXiv (Cornell University)|Jul 12, 2022

Advanced Neural Network Applications被引用 138

一句话总结

Next-ViT 引入 Next Convolution Block (NCB) 和 Next Transformer Block (NTB)，以及 Next Hybrid Strategy (NHS)，在提供接近 CNN 的延迟的同时匹配 ViT 的准确性，在像 TensorRT 和 CoreML 这样的工业部署平台上超越现有模型。

ABSTRACT

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT

研究动机与目标

在工业场景（TensorRT/CoreML）中说明对快速、易部署的视觉变换模型的需求。
设计能够高效结合局部信息（NCB）和全局信息（NTB）的模块。
提出一种混合堆叠策略（NHS），在各阶段平衡 Transformer 与 Convolution 块。
在下游任务上展示相对于 CNN、ViT 以及 CNN-Transformer 混合模型的更优延迟/准确性权衡。

提出的方法

开发 Next-Convolution Block (NCB)，配以 Multi-Head Convolutional Attention (MHCA) 作为部署友好的 token mixer。
开发 Next Transformer Block (NTB)，通过 Efficient Multi-Head Self Attention (E-MHSA) 与 MHCA 融合来捕捉多频信号。
引入 Next Hybrid Strategy (NHS)，在各阶段以配置 (NCB×N + NTB×1) 堆叠 NCB 和 NTB，并通过再重复 (×L) 以在固定延迟下提升性能。
在 TensorRT/CoreML 等硬件上使用 BatchNorm 与 ReLU 代替 LayerNorm/GELU 以加速推理。
提供三种 Next-ViT 变体（S/B/L），具有特定的阶段配置和通道设置（Table 3）。
在 ImageNet-1K 分类上进行训练与评估，并在硬件感知的延迟下评估下游任务（COCO 检测、ADE20K 分割）。

实验结果

研究问题

RQ1是否可以设计一个视觉变换器，在现实工业部署中达到与 CNN 相同的推理速度，同时保留 ViT 级别的准确性？
RQ2部署友好的块（NCB 和 NTB）以及混合策略（NHS）是否在分类、检测和分割任务中改善了延迟/准确性权衡？
RQ3在 TensorRT/CoreML 约束下，不同阶段的堆叠模式（NCB 与 NTB 的组合）对吞吐量和任务性能有何影响？

主要发现

Next-ViT 在 ImageNet-1K 分类任务中实现了同类模型中最佳的延迟/准确性权衡。
在 TensorRT 上，Next-ViT 在相似延迟下，在 COCO 检测上比 ResNet 提升 5.5 mAP（从 40.4 到 45.9），在 ADE20K 分割上提升 7.7% mIoU（从 38.8% 到 46.5%）。
Next-ViT 的性能与 CSWin 相当，但推理速度提升 3.6×。
在 CoreML 上，Next-ViT 在相似延迟下，在 COCO 检测上比 EfficientFormer 提升 4.6 mAP（从 42.6 到 47.2），在 ADE20K 分割上提升 3.5% mIoU（从 45.1% 到 48.6%）。
结果针对 Next-ViT-S/B/L 变体给出，并给出硬件感知延迟的测量（TensorRT/CoreML）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。