QUICK REVIEW

[論文レビュー] Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Chaoning Zhang, Dongshen Han|arXiv (Cornell University)|Jun 25, 2023

Advanced Neural Network Applications被引用数 141

ひとこと要約

本論文はMobileSAMを提案する。Segment Anythingの軽量でモバイル対応の派生モデルで、重厚な画像エンコーダを置換するためにディカップルド蒸留を用い、元のSAMおよびFastSAMと同程度のセグメンテーション性能を、はるかに小さなサイズと高速推論で実現する。

ABSTRACT

Segment Anything Model (SAM) has attracted significant attention due to its impressive zero-shot transfer performance and high versatility for numerous vision applications (like image editing with fine-grained control). Many of such applications need to be run on resource-constraint edge devices, like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, With a single GPU, MobileSAM runs around 10ms per image: 8ms on the image encoder and 4ms on the mask decoder. With superior performance, our MobileSAM is around 5 times faster than the concurrent FastSAM and 7 times smaller, making it more suitable for mobile applications. Moreover, we show that MobileSAM can run relatively smoothly on CPU. The code for our project is provided at \href{https://github.com/ChaoningZhang/MobileSAM}{ extcolor{red}{MobileSAM}}), with a demo showing that MobileSAM can run relatively smoothly on CPU.

研究の動機と目的

リソース制約のあるデバイスへのSAMのモバイル展開を動機付ける。
重い画像エンコーダを置換することでモデルサイズを削減し、推論速度を向上させる。
大規模な再学習を伴わずに元のSAMのマスクデコーダとの互換性を維持する。
蒸留ベースの訓練が、軽量でありながら高精度なセグメンテーションモデルを生み出し得ることを示す。

提案手法

SAMのViT-H画像エンコーダを、教師ViT-Hからの知識蒸留によって軽量なエンコーダに置換する。
デカップルド蒸留を適用して小さな学生エンコーダへ知識を移しつつ、元のマスクデコーダを凍結するか軽微に微調整する。
関連研究で用いられる焦点損失とダイス損失の組み合わせの代わりに、画像埋め込みの整合性を測るためのMSE損失を用いる。
任意でマスクデコーダを微調整するが、デカップルド蒸留がすでに学生エンコーダとデコーダの良好な整合をもたらすことを示す。
mIoUと推論速度指標を用いて、MobileSAMを元のSAMおよびFastSAMと比較評価する。

実験結果

リサーチクエスチョン

RQ1重いSAMエンコーダから蒸留された軽量な画像エンコーダは、元のSAMと同等のセグメンテーション品質を達成できるか？
RQ2デカップルド蒸留は、軽量なSAMの訓練において、カップルド蒸留やセミカップルド蒸留より優れているか？
RQ3Segment-anythingタスクにおける精度(mIoU)と効率(パラメータ数、速度)の観点で、MobileSAMはFastSAMとどのように比較されるか？
RQ4MobileSAMはオンデバイス用途のCPU上で効率的に動作できるか？

主な発見

MobileSAMはエンコーダのパラメータを約100倍、総パラメータを約60倍削減しつつ、元のSAMと同等の性能を達成する。
1つのGPUで、MobileSAMは画像を約10 msで処理する（エンコーダ8 ms、デコーダ4 ms）。
MobileSAMはFastSAMより約5倍速く、約7倍小さく、Segment-anything設定で優れた性能を発揮する。
デカップルド蒸留（重い教師から軽量エンコーダを直接学習し、デコーダの同時訓練を行わない）は、カップルド蒸留（0.72）より良いmIoU（0.75）を示した予備実験。
MobileSAMはCPU上で比較的スムーズに動作し、モバイルデバイスへの展開を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。