QUICK REVIEW

[論文レビュー] RepViT: Revisiting Mobile CNN From ViT Perspective

Ao Wang, Hui Chen|arXiv (Cornell University)|Jul 18, 2023

Robotics and Automated Systems被引用数 27

ひとこと要約

RepViT は、現代化された ViT にインスパイアされた設計を取り入れた純粋な軽量CNNが、モバイルデバイス上の軽量ViTを精度と待機時間で上回ることを示し、ImageNet で 1ms の待機時間で iPhone 12 上の M1 サイズモデルで 80% 超の top-1 を達成する。

ABSTRACT

Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections between lightweight ViTs and lightweight CNNs. However, the notable architectural disparities in the block structure, macro, and micro designs between them have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically, we incrementally enhance the mobile-friendliness of a standard lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1 accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Besides, when RepViT meets SAM, our RepViT-SAM can achieve nearly 10$ imes$ faster inference than the advanced MobileSAM. Codes and models are available at \url{https://github.com/THU-MIG/RepViT}.

研究の動機と目的

モバイルデバイスにおける現行の軽量CNNと軽量ViTの限界を評価する。
ViT に触発された設計選択を探求し、MobileNetV3-L を純粋なCNNバックボーンへと現代化する。
RepViT が ImageNet で優れた待機時間-精度を達成し、下流タスクへも良好に転移することを示す。

提案手法

MobileNetV3-L から開始し、ViT に触発された設計原則を順次組み込む。
構造的再パラメータ化を用いて token mixer と channel mixer を分離することで RepViT ブロックを導入する。
マクロなアーキテクチャの微調整を適用する：早期畳み込みを含むステム、より深いダウンサンプリング、単純化した分類器、最適化されたステージ比率。
マイクロアーキテクチャの洗練を行う：カーネルサイズを3×3へ正規化し、ブロック間のSE配置を行う。
すべてのモデルを ImageNet-1K で訓練・評価する；Core ML Tools を用いて iPhone 12 でのデバイス上の待機時間を測定する；COCO および ADE20K で検証する。

Figure 1 : Comparison of latency and accuracy between RepViT (Ours) and other lightweight models. The top-1 accuracy is tested on ImageNet-1K and the latency is measured by iPhone 12 with iOS 16. RepViT achieves high performance with low latency across various model sizes.

実験結果

リサーチクエスチョン

RQ1軽量 ViT からの設計選択は、モバイルデバイス向けの純粋なCNNの性能と待機時間を改善できるか？
RQ2エッジデバイス上でCNNとViTの効率を最もよく橋渡しするマクロおよびマイクロ設計調整は何か？
RQ3RepViT は ImageNet でどの程度の性能を示し、下流タスクへどの程度転移するのか。最先端の軽量 ViT および CNN と比較して。

主な発見

Model	Type	Params (M)	GMACs	Latency ms	Throughput im/s	Epochs	Top-1 (%)
MobileNetV2x1.0	CONV	3.5	0.3	0.9	6550	300	71.8
RepViT-M0.9	CONV	5.1	0.8	0.9	4817	300/450	78.7/79.1
RepViT-M1.0	CONV	6.8	1.1	1.0	3910	300/450	80.0/80.3
RepViT-M1.5	CONV	14.0	2.3	1.5	2151	300/450	82.3/82.5
RepViT-M2.3*	CONV	22.9	4.5	2.3	1184	300/450	83.3/83.7
PVT-Small	Attention	24.5	3.8	24.4	1165	300	79.8
DeiT-S	Attention	22.5	4.5	11.8	1419	300	81.2
EfficientFormerV2-S2*	Hybrid	6.1	0.7	1.1	1153	300/450	79.0/79.7

RepViT はモデルサイズに関係なく、待機時間と精度の点で既存の最先端軽量 ViT および CNN を上回る。
RepViT-M0.9 から RepViT-M2.3 は、iPhone 12 でのオンデバイス待機時間を大幅に低減し、ImageNet の強力な結果を達成する（例：小型 variant は 1 ms、より大きい variant は 2.3 ms）。
RepViT-M1.0 は iPhone 12 で 1 ms の待機時間で 80% Top-1 精度を超える；RepViT-M2.3 は 2.3 ms の待機時間で 83.7% の精度に達する。
下流タスク（COCO 物体検出/分割および ADE20K セマンティックセグメンテーション）では、RepViT バックボーンが多くの競合よりも低い待機時間で競争力のある AP と mIoU を達成している。
構造的再パラメータ化とブロック間 SE の配置は、精度-待機時間のトレードオフを一貫して改善する。
RepViT は ViT に触発された設計原理を組み込むと、純粋な軽量 CNN がモバイルデバイス上で軽量 ViTs を上回ることを実証している。

Figure 2 : We modernize MobileNetV3-L from various granularities. We mainly consider the latency on mobile devices and the top-1 accuracy on ImageNet-1K. Finally, we obtain a new family of pure lightweight CNNs, namely RepViT, which can achieve lower latency and higher performance.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。