QUICK REVIEW

[论文解读] EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Yunyang Xiong, Bala Varadarajan|arXiv (Cornell University)|Dec 1, 2023

Advanced Neural Network Applications被引用 15

一句话总结

EfficientSAM 引入 SAMI，一种掩码图像预训练方法，训练轻量化 ViT 编码器以从 SAM 图像编码器重建特征，从而在显著降低計算量和参数数量的同时，产生在分割任务上具有竞争力的高效 SAM 变体。

ABSTRACT

Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.

研究动机与目标

在不牺牲分割性能的前提下，降低 Segment Anything Models (SAM) 的计算与内存负担。
提出一个以 SAM 为支撑的掩码图像预训练（SAMI）框架，将 SAM 特征用作轻量编码器的重建目标。
证明 SAMI 预训练的骨干在图像分类、对象检测、语义分割和 segment-anything 任务上具有良好的泛化性。
展示 EfficientSAM（轻量编码器+ SAM 解码器）在零-shot 与可提示分割任务上实现有利的质量与效率权衡。

提出的方法

将 Masked Autoencoder (MAE) 预训练改编为以 SAM ViT-H 编码器的潜在特征作为监督目标进行重建。
使用一个交叉注意力解码器，其查询来自被掩蔽的 token，键和值来自编码器输出和被掩蔽特征。
将被掩蔽 token 的解码器输出与编码器输出合并，形成 MAE 输出，然后应用线性投影头以与 SAM 特征对齐。
在 ImageNet-1K (224x224) 上以 75% 掩蔽比例和 400 轮进行预训练，使用 MSE 损失来最小化 SAM 特征与 MAE 输出之间的重建误差。
使用 SAM 的默认解码器，在 SA-1B 上对 SAMI 预训练的轻量编码器（如 ViT-Tiny/Small）进行微调，用于 Segment Anything 任务。
预训练完成后，丢弃 MAE 解码器，将 SAMI 预训练的编码器用作下游任务（分类、检测、分割）的图像主干。

实验结果

研究问题

RQ1与标准 MAE 及其他预训练基线相比，SAMI 预训练能否提升轻量 ViT 编码器的表征质量？
RQ2SAMI 预训练的骨干是否在图像分类、对象检测、语义分割和 segment-anything 任务中具有良好的泛化能力？
RQ3与 SAM、MobileSAM 和 FastSAM 相比，EfficientSAM（轻量编码器 + SAM 解码器）在零-shot 与交互分割上的表现如何？
RQ4重建目标、损失、掩蔽比例和微调步骤对下游性能有何影响？
RQ5在实际部署中，EfficientSAM 在模型大小、速度和分割质量之间是否存在有利的权衡？

主要发现

SAMI 在 ViT-Tiny/Small/Base 上相比 MAE 和若干基线提升 ImageNet-1K top-1 准确率（如 SAMI-B 达到 84.8% 而 MAE-B 为 83.6%）。
在 COCO 对象检测和实例分割上，SAMI 骨干在 AP/bbox 与 AP/mask 上优于 MAE 对应物（如 SAMI-B 52.5/46.5 对比 MAE-B 51.6/45.9）。
在 ADE20K 语义分割上，SAMI 骨干的 mIoU 高于 MAE 骨干（如 SAMI-B 51.8 对比 MAE-B 49.3）。
EfficientSAM-Ti 和 EfficientSAM-S 在 COCO/LVIS 的零-shot 实例分割上具有竞争力，其中 EfficientSAM-S 在 1-2-3 点击设置下达到 COCO 60.1 AP、LVIS 62.3 AP，零-shot 单点评估最高可达到 76.9 mIoU。
EfficientSAM-S（9.8M 参数）在零-shot 实例分割方面接近 SAM 表现，AP 仅下降约 2，且 EfficientSAM-Ti 在若干提示下超越 MobileSAM 与 FastSAM。
消融实验表明 MSE 重建损失优于余弦损失，高掩蔽比例（约 75%）有益，并且将 SAM 特征用作锚点有助于掩蔽 token 的重建。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。