QUICK REVIEW

[论文解读] Segment Everything Everywhere All at Once

Xueyan Zou, Jianwei Yang|arXiv (Cornell University)|Apr 13, 2023

Multimodal Machine Learning Applications被引用 151

一句话总结

SEEM 是一个可提示的、交互式模型，将多种分割任务（通用、指代、交互、视频）统一到一个通用界面，使用联合的视觉-语义提示空间和记忆提示进行迭代细化。

ABSTRACT

In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown in Fig.1. In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like large language models (LLMs). More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce a new visual prompt to unify different spatial queries including points, boxes, scribbles and masks, which can further generalize to a different referring image; ii) Compositionality. We learn a joint visual-semantic space between text and visual prompts, which facilitates the dynamic composition of two prompt types required for various segmentation tasks; iii) Interactivity. We further incorporate learnable memory prompts into the decoder to retain segmentation history through mask-guided cross-attention from decoder to image features; and iv) Semantic-awareness. We use a text encoder to encode text queries and mask labels into the same semantic space for open-vocabulary segmentation. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks. Notably, our single SEEM model achieves competitive performance across interactive segmentation, generic segmentation, referring segmentation, and video object segmentation on 9 datasets with minimum 1/100 supervision. Furthermore, SEEM showcases a remarkable capacity for generalization to novel prompts or their combinations, rendering it a readily universal image segmentation interface.

研究动机与目标

为处理多样化提示与任务的通用分割接口的需求提供动机。
提出一种提示方案，将空间查询、文本和记忆历史编码到共享的视觉-语义空间。
开发 SEEM，一种轻量级解码器驱动的模型，支持零-shot 提示组合、交互性和开放集语义。
展示 SEEM 在全景、实例、语义、指代、交互和视频分割任务上的竞争性能。

提出的方法

将所有提示类型（点、框、涂鸦、掩模、文本、被指Region）编码到通过视觉采样器和文本编码器的联合视觉-语义空间。
使用基于 Transformer 的编码器-解码器（SEEM-Decoder），在查询和多模态提示之间进行跨注意力以产生掩模与类别嵌入。
引入记忆提示，通过掩模引导的跨注意力携带历史信息，以实现交互式细化。
通过将视觉提示和文本提示与输出嵌入进行匹配，实现组合提示，允许零样本提示的组合。
以对全景、指代和交互分割的损失线性组合进行训练，以学习统一的提示与输出。

实验结果

研究问题

RQ1SEEM 能否作为一个单一模型来处理开放词汇的通用分割、指代表分割和交互式分割？
RQ2在推理阶段，联合视觉-语义提示空间是否能够在文本、视觉、记忆等提示类型之间实现有效的组合提示？
RQ3记忆提示如何在多轮交互中影响交互式分割的效率与准确性？
RQ4相对于专门模型，SEEM 在全景、实例、语义和视频对象分割上的性能如何？
RQ5SEEM 在零样本场景中对新提示或提示组合的泛化能力如何？

主要发现

SEEM 在9个数据集上，使用有限的监督，在全景、实例、语义、指称、交互和视频分割方面具有竞争性表现。
引入视觉提示和组合提示在指代分割精度方面带来显著提升，特别是在提示组合时。
记忆提示通过轻量解码实现面向历史的掩模细化，提升交互效率。
SEEM 展示了在没有视频特定训练的情况下的零样本视频对象分割能力，包括在 DAVIS 数据集上的交互式 VOS。
SEEM 在交互分割上优于若干通用或可提示基线，并在开放词汇与跨域泛化方面表现出强劲能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。