QUICK REVIEW

[论文解读] MobileSAMv2: Faster Segment Anything to Everything

Chaoning Zhang, Dongshen Han|arXiv (Cornell University)|Dec 15, 2023

Advanced Neural Network Applications被引用 10

一句话总结

MobileSAMv2 通过使用来自 YOLOv8 的面向对象的框提示来加速 SegEvery，减少提示数量并保持有竞争力的分割质量，并且与蒸馏图像编码器兼容。

ABSTRACT

Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: extbf{segment anything (SegAny)}, which utilizes a certain point to predict the mask for a single object of interest, and extbf{segment everything (SegEvery)}, which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% extit{v.s.} 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@$K$ metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project \href{https://github.com/ChaoningZhang/MobileSAM}{ extcolor{red}{https://github.com/ChaoningZhang/MobileSAM}}. \end{abstract}

研究动机与目标

推动比原始 SAM 更快的 SegEvery（对所有进行分割）通过减少提示冗余。
提出一种使用来自开放世界检测器的边界框的面向对象的提示采样策略。
证明面向对象的提示与 MobileSAM 中蒸馏图像编码器的兼容性。
展示该方法在实现更高效率的同时保持或提升 SegEvery 的性能。

提出的方法

用来自开放世界检测器（YOLOv8）的面向对象框提示替换网格搜索点提示。
在开放世界数据（SA-1B 的子集）上训练 YOLOv8，以生成边界框和掩码，并使用 NMS 进行过滤。
直接将框提示（或其中心点）用作 SAM 掩码解码器的提示，避免多掩码歧义。
在批量中使用 SAM 解码器进行提示引导的掩码解码，但现在使用的提示更少（最多 320 个框）。
通过将提示编码（预采样）与掩码解码（后过滤）分离来比较效率，突出较少提示带来的时间减少。
比较效率，分离提示编码（预采样）和掩码解码（后过滤），突出较少提示带来的时间减少。

实验结果

研究问题

RQ1Can object-aware box prompts accelerate SegEvery without sacrificing mask quality?
RQ2How does prompting with boxes compare to grid-search prompts in terms of AR@K on LVIS?
RQ3Is the object-aware prompting approach compatible with distilled image encoders in MobileSAM?
RQ4What is the trade-off between prompt quantity and segmentation accuracy for SegEvery?

主要发现

Sampling strategy	Prompt Encoding	Mask Decoding	Total
Grid-search sampling (32×32 prompts)	16 ms	1600 ms	1616 ms
Grid-search sampling (64×64 prompts)	64 ms	6400 ms	6464 ms
Object-aware sampling (max 320 prompts)	47 ms	50 ms	97 ms
MobileSAMv2 (max 320 boxes)	59 ms	18 ms	77 ms

Significant speedup: object-aware prompt sampling reduces total time of the mask decoder by at least 16x compared with grid-search sampling.
Average LVIS AR@K improves by 3.6% (42.5% with MobileSAMv2 vs 38.9% with baseline SAM).
With 320 box prompts, MobileSAMv2 achieves 59.3% mask AR@1000 (vs 59.2% for the 64×64 grid) while using far fewer prompts.
Box prompts yield higher-quality masks with less ambiguity than point prompts, reducing the need for heavy post-filtering.
Using distilled image encoders (EfficientViT-L2) causes a modest performance drop but substantial speed gains (56.3% overall vs 59.2% with ViT-H).
MobileSAMv2 is compatible with MobileSAM’s distilled encoders, enabling a unified framework for efficient SegAny and SegEvery.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。