QUICK REVIEW

[论文解读] Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu|arXiv (Cornell University)|Jan 25, 2024

Multimodal Machine Learning Applications被引用 88

一句话总结

本论文介绍 Segment Anything (SAM)，一种可提示的分割模型，在 SA-1B (1B masks over 11M images) 上训练，取得强大的零-shot 性能，并支持跨 diverse tasks 的交互式、实时掩码生成。

ABSTRACT

We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP and Recognize Anything. Additionally, incorporating Stable-Diffusion allows for controllable image editing, while the integration of OSX facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.

研究动机与目标

定义一个可提示的分割任务，以实现跨分割任务的零-shot泛化。
开发一个轻量而灵活的模型（SAM），支持各种提示并实现实时掩码生成。
创建一个数据引擎，自动构建一个庞大且多样化的分割数据集（SA-1B）。
评估 SAM 在多个下游任务与分布上的零-shot 转移能力。
在数据和模型性能中解决负责任 AI 的考量与偏见问题。

提出的方法

提出可提示的分割任务，在给定任意提示时返回有效掩码，从而通过提示进行预训练和下游应用。
设计 SAM，包含三部分：一个预训练的图像编码器、一个灵活的提示编码器和一个快速的掩码解码器。
通过允许每个提示产生多个掩码及其相关的置信分数，使 SAM 对歧义具有感知。
用稀疏和密集提示的混合训练 SAM，使用结合 focal loss 与 dice loss 的损失，并增加多轮模拟提示以反映交互使用。
构建一个数据引擎，包含 assisted-manual、semi-automatic 和 fully automatic 阶段，以模型回路方式收集掩码。
通过将最终的、具有歧义感知的 SAM 应用于覆盖 11M 张图像的 32x32 提示网格来自动生成 SA-1B，随后进行掩码细化步骤。

实验结果

研究问题

RQ1哪种任务能够在分割中实现零-shot 泛化？
RQ2哪种模型结构支持具备实时性能和歧义处理能力的可提示分割？
RQ3训练一个鲁棒的可提示分割模型需要怎样的数据规模与多样性？
RQ4通过提示，可提示分割模型能否有效地迁移到下游任务？
RQ5在零-shot 设置下，SAM 在不同数据集与分布上的表现如何？

主要发现

SAM 能从一个前景点生成高质量掩码，通常接近真实标签的表现。
SAM 在23个分割数据集上表现出强烈的零-shot 转移能力，常常优于或匹配专用基线。
SA-1B 数据集包含在 11M 张图像上超过 1.1B 掩码，在规模和多样性上都显著超过以前的数据集。
数据引擎和全自动阶段实现了可扩展的掩码生成，同时不牺牲质量（在示例中与专业注释相比具有高 IoU）。
具歧义感知的提示产生多个有效掩码及置信分数，提升对模糊提示的处理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。