QUICK REVIEW

[논문 리뷰] Mobile-Former: Bridging MobileNet and Transformer

Yinpeng Chen, Xiyang Dai|arXiv (Cornell University)|2021. 08. 12.

Advanced Neural Network Applications인용 수 36

한 줄 요약

모바일-포머(Mobile-Former)가 MobileNet과 경량 Transformer를 양방향 브리지로 병렬 처리하여 ImageNet에서 FLOPs가 유사하거나 더 낮은 수준에서 더 높은 정확도를 달성하고 MobileNetV3 및 DETR 기준선보다 물체 탐지 성능이 우수합니다.

ABSTRACT

We present Mobile-Former, a parallel design of MobileNet and transformer with a two-way bridge in between. This structure leverages the advantages of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different from recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized to learn global priors, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power. It outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP in RetinaNet framework. Furthermore, we build an efficient end-to-end detector by replacing backbone, encoder and decoder in DETR with Mobile-Former, which outperforms DETR by 1.1 AP but saves 52\% of computational cost and 36\% of parameters.

연구 동기 및 목표

CNN의 로컬 특성 처리와 Transformer의 글로벌 상호작용을 병렬 설계로 결합하는 효율적 아키텍처를 동기화하려는 동기 부여.
최소한의 계산으로 로컬 및 글로벌 특징을 융합하는 경량 양방향 브리지를 도입합니다.
작은 토큰 기반 Transformer가 낮은 FLOP 영역에서 큰 비용 없이 상당한 이점을 제공할 수 있음을 보여줍니다.
ImageNet 분류 및 물체 탐지/엔드-투-엔드 DETR 유사 파이프라인에서 개선을 입증합니다.
토큰의 기여도, 차원 수, 동적 ReLU의 역할을 이해하기 위한 구현별 실험을 탐구합니다.

제안 방법

MobileNet 블록과 소형 토큰 Transformer(M ≤ 6, d ≤ 192)로 구성된 Mobile-Former를 네트워크의 병렬 아키텍처로 제시하고 학습 가능한 글로벌 토큰을 도입합니다.
연산을 절약하기 위해 Mobile 측의 Q/K/V 프로젝션을 제거하면서 Mobile ↔ Former 및 Former ↔ Mobile 상호 작용을 가능하게 하는 가벼운 크로스 어텐션 브리지를 도입합니다.
Mobile 서브 블록, Former 서브 블록, 두 개의 크로스 어텐션 다리(Mobile→Former, Former→Mobile)로 구성된 Mobile-Former 블록을 정의합니다.
전역 토큰에서 생성된 매개변수를 사용하여 Mobile 분기에 공간 의존적인 동적 ReLU를 적용합니다. 엔드-투-엔드 탐지기의 헤드에서 모든 토큰을 매개변수 생성에 활용하는 개선도 포함합니다.
실험을 위한 네트워크 변형(Mobile-Former-26M에서 Mobile-Former-508M까지)과 294M FLOP 구성에서 6개의 글로벌 토큰과 차원 192를 사용한 구성의 세부를 제시합니다.

실험 결과

연구 질문

RQ1저비용 FLOP에서 ImageNet에서 기존 CNN과 ViT를 능가하는 병렬 MobileNet-Transformer 설계 및 경량 양방향 브리지를 제시할 수 있는가?
RQ2작은 토큰 Transformer가 MobileNet과 효율적인 브리지를 통해 글로벌 상호작용을 모델링하기에 충분한가?
RQ3토큰 수와 토큰 차원이 Mobile-Former의 정확도와 효율성에 미치는 영향은 무엇인가?
RQ4Mobile-Former가 RetinaNet 및 DETR 유사 탐지기의 효율적인 백본으로 작동하여 계산 비용을 줄이면서 AP를 개선할 수 있는가?

주요 결과

Mobile-Former는 294M FLOPs에서 ImageNet 상위 1% 정확도 77.9%를 달성하며 MobileNetV3를 능가하고 계산량을 17% 절감합니다.
객체 탐지에서 Mobile-Former 백본은 RetinaNet의 AP를 MobileNetV3 대비 동일 비용에서 8.6포인트 증가시킵니다.
백본/인코더/디코더를 DETR 대신 Mobile-Former로 교체한 엔드-투-엔드 탐지기가 DETR 대비 1.1 AP 높은 성능과 52% FLOPs 감소, 36% 매개변수 감소를 달성합니다.
25M에서 500M FLOPs에 걸쳐 Mobile-Former가 저 FLOP 예산에서 효율적 CNN과 비전 트랜스포머 모두를 지속적으로 능가합니다.
단일 글로벌 토큰으로도 강력한 성능이 나타나며 6개의 토큰(d=192)에서 포화에 이르기 전까지 계속 이득이 증가합니다.
공간 의존적 동적 ReLU와 위치 임베딩의 적응은 COCO 탐지에서 유의미한 이득에 기여합니다(3개 구성요소 실험에서 누적 개선이 나타남).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.