QUICK REVIEW

[論文レビュー] MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Sachin Mehta, Mohammad Rastegari|arXiv (Cornell University)|Oct 5, 2021

Advanced Neural Network Applications被引用数 727

ひとこと要約

MobileViT は CNN に似たローカルな帰納バイアスと Transformer ベースのグローバル処理を組み合わせ、軽量で携帯に適した Vision Transformer を実現。類似のパラメータ数のCNNとViTを上回る性能を発揮。ImageNet、検出、セグメンテーションで強力な性能を、シンプルなトレーニングレシピで提供。

ABSTRACT

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. Our source code is open-source and available at: https://github.com/apple/ml-cvnets

研究の動機と目的

CNN の誘導バイアスとグローバル処理を組み合わせた軽量で携帯適合の Vision Transformer の必要性を動機づける。
ローカル畳込み処理とパッチレベルの Transformer 注意を融合する MobileViT ブロックを提案する。
MobileViT が分類、検出、セグメンテーションタスクで、パラメータ数を抑えつつ、シンプルなトレーニングレシピで競争力のある精度を発揮することを示す。

提案手法

局所処理のための n×n 畳み込みを適用し、次に次元 d への 1×1 投影を行い、X をパッチに展開し、各パッチ列にトランスフォーマーを適用してパッチ間の関係をモデル化し、再度高解像度の特徴に折りたたむ。
変換された特徴を元の特徴と連結と、後続のフュージョン畳み込みを用いて出力を得る。
MobileViT を 3 つのネットワークサイズ (S, XS, XXS) で訓練。ストライド 3×3 のステム、下采分用 MV2 ブロック、Swish 活性化、基本的なデータ拡張と AdamW オプティマイザを使用。
訓練中にマルチスケールのサンプラーを用いて入力解像度とバッチサイズを変え、効率と一般化を改善。
ImageNet-1k、MS-COCO (SSD/SSDLite)、PASCAL VOC 2012 の標準評価指標で、軽量な CNN、ViT 変種、重厚な CNN と比較。

実験結果

リサーチクエスチョン

RQ1ローカル畳み込みとグローバル Transformer 処理を組み合わせることで、軽量で携帯に適した ViT モデルが CNN 的な性能を達成できるか？
RQ2MobileViT は、同程度のパラメータ予算で、純粋な ViT 変種や比較対象CNNと比較して、一般化と訓練の堅牢性においてより優れているか？
RQ3MobileViT はモバイルプラットフォームで分類、検出、セグメンテーションの汎用バックボーンとして機能し得るか？
RQ4マルチスケール訓練とパッチサイズの選択が、モバイルデバイスでの精度と待機時間に与える影響は？

主な発見

ImageNet-1k で、MobileViT-S は 5.6M パラメータでトップ1 78.4%、同等予算の MobileNetv3 および DeIT を上回る。
MobileViT-XS (~2.3M 参数) が 74.8% top-1、XXS/XS/S バリアントは軽量 CNN と比較してパラメータ-精度のトレードオフが有利。
MS-COCO 物体検出では、MobileViT-XS/S バックボーンが MobileNetv3 より最大約 1.8% の mAP 向上、より小さなモデルで。
DeepLabv3 によるセマンティックセグメンテーションで、MobileViT バックボーンは 77.1% mIoU（MobileViT-XS）と 79.1%（MobileViT-S）を達成、ResNet-101 ベースのバックボーンよりもパラメータが大幅に少ない。
MobileViT バックボーンは、一般化と訓練の堅牢性が向上し、シンプルな拡張と L2 正則化の感度が多くの ViT 変種より低い。
モバイルデバイス上で、MobileViT はリアルタイム推論を実現し、ハードウェアとカーネル最適化を考慮した場合、典型的な ViT バックボーンよりも高速。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。