QUICK REVIEW

[論文レビュー] MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Sachin Mehta, Mohammad Rastegari|arXiv (Cornell University)|Oct 5, 2021

Advanced Neural Network Applications参考文献 60被引用数 142

ひとこと要約

MobileViT は CNN の帰納的バイアスをトランスフォーマーのグローバル処理と組み合わせ、軽量でモバイルに適したビジョン・トランスフォーマーを構築。類似のパラメータ予算のCNNやViTを上回る。シンプルなトレーニングレシピでImageNetおよびCOCOの性能を強く達成し、モバイル対応の効率性を示す。

ABSTRACT

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. Our source code is open-source and available at: https://github.com/apple/ml-cvnets

研究の動機と目的

CNN の帰納的バイアスを保持しつつ、グローバルな表現学習を可能にする、軽量でモバイル対応のビジョン・トランスフォーマーの設計を動機づける。
局所畳み込みとトランスフォーマーに基づくグローバル処理を融合する MobileViT ブロックを提案する。
控えめなパラメータ数と単純なトレーニング設定で、MobileViT がモバイルビジョンタスクにおいて優れた精度を達成することを示す。
MobileViT のモバイルビジョンベンチマークで検出およびセマンティッドセグメンテーションのバックボーンとしての汎用性を示す。

提案手法

特徴マップをパッチベースの表現に展開し、パッチごとにトランスフォーマを適用してパッチ間の関係を捉える MobileViT ブロックを導入する。
パッチ単位のグローバル表現を局所的なCNN特徴と結合（連結）し、その後の 1x1 投影と最終畳み込みによる融合で結合する。
基本的なデータ拡張とマルチスケール訓練サンプラーを用いて、軽量な MobileViT ネットワーク（XS、S、XXS）を訓練し、効率と一般性を向上させる。
SSDベースの検出（SSDLite）および DeepLabv3 ベースのセグメンテーションでバックボーンとしての MobileViT を評価し、汎用性を示す。
訓練効率とマルチスケール表現を改善するため、可変バッチサイズを用いたマルチスケール訓練について論じる。

実験結果

リサーチクエスチョン

RQ1同等のパラメータ予算で ImageNet-1k において、軽量なビジョン・トランスフォーマーが最先端の軽量CNNやViTを上回ることができるか？
RQ2CNN のような局所処理をトランスフォーマー基盤のグローバル処理と統合することで、モバイルビジョンタスクの一般化性と頑健性が向上するか？
RQ3MobileViT はモバイルの物体検出および意味的セグメンテーションのバックボーンとして効果的に機能するか？
RQ4可変バッチサイズを用いたマルチスケール訓練は MobileViT の効率と性能に有益か？

主な発見

MobileViT は約 5–6 百万パラメータで ImageNet-1k においてトップ1精度 78.4% を達成し、同等予算下で MobileNetv3 および DeiT を上回る。
MS-COCO 検出では、MobileViT-XS/S バックボーンは MobileNet ベースのバックボーンより mAP を最大で 1.8–?%、SSD/SSDLite 構成で小さなパラメータ数で改善。
DeepLabv3 セグメンテーションの MobileViT バックボーンは、はるかに多いパラメータを要する重い CNN バックボーンと同等かそれを上回る性能を、はるかに少ないパラメータ数で達成（例: 6.4M パラメータで 79.1 mIoU、ResNet-101 で 80.5）。
純粋な ViT と比較して、基本的な拡張だけで最適化の安定性と一般化性を高め、トレーニング上のテクニックを減らす。
モバイルデバイス上で、MobileViT はリアルタイム推論（FPS）を達成し、いくつかのViTベースラインよりパラメータ数が少なく、iPhone 12 で 2.3M パラメータモデルの推論時間は約 7.28 ms 程度。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。