[論文レビュー] MobileSAMv2: Faster Segment Anything to Everything
MobileSAMv2 は YOLOv8 からのオブジェクト認識ボックスプロンプトを用いることで SegEvery を高速化し、プロンプトを削減しつつ競争力のあるセグメンテーション品質を維持し、蒸留型画像エンコーダと互換性があります。
Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: extbf{segment anything (SegAny)}, which utilizes a certain point to predict the mask for a single object of interest, and extbf{segment everything (SegEvery)}, which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% extit{v.s.} 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@$K$ metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project \href{https://github.com/ChaoningZhang/MobileSAM}{ extcolor{red}{https://github.com/ChaoningZhang/MobileSAM}}. \end{abstract}
研究の動機と目的
- Motivate faster SegEvery (segment everything) beyond the original SAM by reducing prompt redundancy.
- Propose an object-aware prompt sampling strategy using bounding boxes from an open-world detector.
- Demonstrate compatibility of object-aware prompts with distilled image encoders in MobileSAM.
- Show that the approach preserves or improves SegEvery performance while achieving higher efficiency.
提案手法
- Replace grid-search point prompts with object-aware box prompts sourced from an open-world detector (YOLOv8).
- TrainYOLOv8 on open-world data (subset of SA-1B) to produce bounding boxes and masks, and filter with NMS.
- Use the box prompts directly (or their centers) as prompts for the SAM mask decoder, avoiding multi-mask ambiguity.
- Prompt-guided mask decoding is performed in batches with the SAM decoder, but now uses far fewer prompts (max 320 boxes).
- Compare efficiency by separating prompt encoding (pre-sampling) and mask decoding (post-filtering), highlighting reduced time from fewer prompts.
実験結果
リサーチクエスチョン
- RQ1Can object-aware box prompts accelerate SegEvery without sacrificing mask quality?
- RQ2How does prompting with boxes compare to grid-search prompts in terms of AR@K on LVIS?
- RQ3Is the object-aware prompting approach compatible with distilled image encoders in MobileSAM?
- RQ4What is the trade-off between prompt quantity and segmentation accuracy for SegEvery?
主な発見
| Sampling strategy | Prompt Encoding | Mask Decoding | Total |
|---|---|---|---|
| グリッド探索サンプリング (32×32 プロンプト) | 16 ms | 1600 ms | 1616 ms |
| グリッド探索サンプリング (64×64 プロンプト) | 64 ms | 6400 ms | 6464 ms |
| オブジェクト認識型サンプリング (最大 320 プロンプト) | 47 ms | 50 ms | 97 ms |
| MobileSAMv2 (最大 320 ボックス) | 59 ms | 18 ms | 77 ms |
- Significant speedup: object-aware prompt sampling reduces total time of the mask decoder by at least 16x compared with grid-search sampling.
- Average LVIS AR@K improves by 3.6% (42.5% with MobileSAMv2 vs 38.9% with baseline SAM).
- With 320 box prompts, MobileSAMv2 achieves 59.3% mask AR@1000 (vs 59.2% for the 64×64 grid) while using far fewer prompts.
- Box prompts yield higher-quality masks with less ambiguity than point prompts, reducing the need for heavy post-filtering.
- Using distilled image encoders (EfficientViT-L2) causes a modest performance drop but substantial speed gains (56.3% overall vs 59.2% with ViT-H).
- MobileSAMv2 is compatible with MobileSAM’s distilled encoders, enabling a unified framework for efficient SegAny and SegEvery.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。