QUICK REVIEW

[论文解读] Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

Jielu Zhang, Zhongliang Zhou|arXiv (Cornell University)|Apr 20, 2023

Advanced Image and Video Retrieval Techniques被引用 31

一句话总结

Text2Seg 提出一个训练免费管线，将多种视觉基础模型（SAM、Grounding DINO、CLIP）结合起来，以文本提示引导 SAM 进行跨多数据集的遥感语义分割。它展示了定性改进并讨论领域偏移和类别定义的局限性。

ABSTRACT

Remote sensing imagery has attracted significant attention in recent years due to its instrumental role in global environmental monitoring, land usage monitoring, and more. As image databases grow each year, performing automatic segmentation with deep learning models has gradually become the standard approach for processing the data. Despite the improved performance of current models, certain limitations remain unresolved. Firstly, training deep learning models for segmentation requires per-pixel annotations. Given the large size of datasets, only a small portion is fully annotated and ready for training. Additionally, the high intra-dataset variance in remote sensing data limits the transfer learning ability of such models. Although recently proposed generic segmentation models like SAM have shown promising results in zero-shot instance-level segmentation, adapting them to semantic segmentation is a non-trivial task. To tackle these challenges, we propose a novel method named Text2Seg for remote sensing semantic segmentation. Text2Seg overcomes the dependency on extensive annotations by employing an automatic prompt generation process using different visual foundation models (VFMs), which are trained to understand semantic information in various ways. This approach not only reduces the need for fully annotated datasets but also enhances the model's ability to generalize across diverse datasets. Evaluations on four widely adopted remote sensing datasets demonstrate that Text2Seg significantly improves zero-shot prediction performance compared to the vanilla SAM model, with relative improvements ranging from 31% to 225%. Our code is available at https://github.com/Douglas2Code/Text2Seg.

研究动机与目标

激发并探索如何在最少的任务特定微调下，重新利用视觉基础模型用于遥感语义分割。
提出一个提示工程管线，利用多个 FMs 在文本引导的设定中为 SAM 提供引导。
在多个遥感数据集上评估该管线，以评估对传感器、区域和分辨率的鲁棒性和泛化性。

提出的方法

描述并将视觉基础模型（SAM、Grounding DINO、CLIP、CLIP Surgery）整合到三层管线。
使用预-SAM 提示（点、边界框）来自 Grounding DINO 和 CLIP Surgery 以约束 SAM 的分割。
使用 post-SAM 过滤（CLIP）通过与文本提示的语义相似性来筛选 SAM 产生的掩膜。
在不同数据集上测试模型输入的组合（Grounding DINO+SAM、CLIPS+SAM、SAM+CLIP、Grounding DINO+CLIPS+SAM 等等在不同数据集上的组合。
以网格点提示作为基线调查 SAM 的通用分割，以评估遥感中的分割边界。

实验结果

研究问题

RQ1是否可以有效地将多种视觉基础模型结合起来以在遥感中引导 SAM 进行语义分割，而不进行任务特定的微调？
RQ2哪些预-SAM 与后-SAM 提示的组合在不同的遥感数据集上能够产生最准确的语义分割？
RQ3遥感数据的领域特征（如颜色通道、分辨率、传感器）如何影响文本引导 FM 管线的性能？
RQ4在应用于高分辨率遥感影像时，当前 FMs（SAM、Grounding DINO、CLIP）的局限性和失效模式是什么？

主要发现

在 UAV 和城市场景中，使用 Grounding DINO + SAM 往往能够产生准确而保守的分割结果。
Grounding DINO、CLIP Surgery、SAM 和 CLIP 的组合通常在跨数据集上产生最全面的分割。
性能因数据集和类别而异，建筑、道路和水域通常比荒地、森林或背景类别更易分割。
Vaihingen 和 Potsdam 数据集因传感器特性（如近红外效应）对树木分割产生显著不同的响应。
该管线在定性结果方面具有潜力，但在更抽象的类别和领域特定颜色通道方面存在局限性。
基于 CLIP 的后处理可以过滤 SAM 结果，但可能根据文本提示和图像特征引入错误。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。