QUICK REVIEW

[논문 리뷰] RepViT: Revisiting Mobile CNN From ViT Perspective

Ao Wang, Hui Chen|arXiv (Cornell University)|2023. 07. 18.

Robotics and Automated Systems인용 수 27

한 줄 요약

RepViT는 ViT-inspired 아키텍처 선택으로 현대화된 순수 경량 CNN이 모바일 기기에서 경량 ViT를 능가할 수 있음을 보여주며, iPhone 12에서 1ms 지연으로 M1-크기 모델의 ImageNet에서 80% 넘는 Top-1을 달성합니다.

ABSTRACT

Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections between lightweight ViTs and lightweight CNNs. However, the notable architectural disparities in the block structure, macro, and micro designs between them have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically, we incrementally enhance the mobile-friendliness of a standard lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1 accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Besides, when RepViT meets SAM, our RepViT-SAM can achieve nearly 10$ imes$ faster inference than the advanced MobileSAM. Codes and models are available at \url{https://github.com/THU-MIG/RepViT}.

연구 동기 및 목표

현재 경량 CNN과 모바일 기기에서의 경량 ViT의 한계를 평가한다.
MobileNetV3-L을 순수 CNN 백본으로 현대화하기 위해 ViT-inspired 아키텍처 선택을 모색한다.
RepViT가 ImageNet에서 우수한 지연-정확도 성능을 달성하고 다운스트림 작업으로의 전달이 잘되는지 입증한다.

제안 방법

MobileNetV3-L에서 시작해 ViT-inspired 디자인 원칙을 점진적으로 도입한다.
구조 재매개화로 토큰 믹서와 채널 믹서를 분리하여 RepViT 블록을 도입한다.
매크로 아키텍처 수정: 초기 합성곱으로 스템, 더 깊은 다운샘플링, 간소화된 분류기, 최적화된 스테이지 비율을 적용한다.
마이크로 아키텍처 정제: 커널 크기를 3x3으로 정규화하고 블록 간 SE 배치를 교차시키는 설계를 적용한다.
ImageNet-1K에서 모든 모델을 학습 및 평가하고, Core ML Tools로 iPhone 12에서 온-디바이스 지연을 측정하며 COCO와 ADE20K로 검증한다.

Figure 1 : Comparison of latency and accuracy between RepViT (Ours) and other lightweight models. The top-1 accuracy is tested on ImageNet-1K and the latency is measured by iPhone 12 with iOS 16. RepViT achieves high performance with low latency across various model sizes.

실험 결과

연구 질문

RQ1경량 ViT의 아키텍처 선택이 순수 CNN의 모바일 기기에서의 성능과 지연을 개선할 수 있는가?
RQ2Edge 기기에서 CNN과 ViT의 효율성을 가장 잘 잇는 매크로/마이크로 디자인 조정은 무엇인가?
RQ3RepViT가 ImageNet에서의 성능 및 다운스트림 작업으로의 전달에서 최신의 경량 ViT 및 CNN과 비교하여 어떤 위치에 있는가?

주요 결과

모델	유형	매개변수 (M)	GMACs	지연(ms)	처리량 im/s	에포크	Top-1 (%)
MobileNetV2x1.0	CONV	3.5	0.3	0.9	6550	300	71.8
RepViT-M0.9	CONV	5.1	0.8	0.9	4817	300/450	78.7/79.1
RepViT-M1.0	CONV	6.8	1.1	1.0	3910	300/450	80.0/80.3
RepViT-M1.5	CONV	14.0	2.3	1.5	2151	300/450	82.3/82.5
RepViT-M2.3*	CONV	22.9	4.5	2.3	1184	300/450	83.3/83.7
PVT-Small	Attention	24.5	3.8	24.4	1165	300	79.8
DeiT-S	Attention	22.5	4.5	11.8	1419	300	81.2
EfficientFormerV2-S2*	Hybrid	6.1	0.7	1.1	1153	300/450	79.0/79.7

RepViT는 모델 규모에 상관없이 지연-정확도 측면에서 기존의 최첨단 경량 ViT 및 CNN보다 우수한 성능을 보인다.
RepViT-M0.9에서 RepViT-M2.3까지는 iPhone 12에서 온-디바이스 지연이 크게 감소한 채로 ImageNet에서 강력한 성능을 달성한다(예: 작은 변형은 1 ms, 큰 변형은 2.3 ms).
RepViT-M1.0은 iPhone 12에서 1 ms 지연으로 80% 이상 Top-1 정확도 달성; RepViT-M2.3은 2.3 ms 지연에서 83.7%의 정확도에 도달한다.
다운스트림 과제(COCO 객체 인식/세분화 및 ADE20K 의미론적 분할)는 RepViT 백본이 경쟁력 있는 AP 및 mIoU를 보여주며 다수의 경쟁자보다 낮은 지연을 달성한다.
구조적 재매개화와 블록 간 SE 배치는 정확도-지연 트레이드오프를 일관되게 개선한다.
RepViT는 ViT-inspired 아키텍처 원칙을 통합하면 모바일 기기에서 순수 경량 CNN이 경량 ViT를 능가할 수 있음을 보여준다.

Figure 2 : We modernize MobileNetV3-L from various granularities. We mainly consider the latency on mobile devices and the top-1 accuracy on ImageNet-1K. Finally, we obtain a new family of pure lightweight CNNs, namely RepViT, which can achieve lower latency and higher performance.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.