QUICK REVIEW

[论文解读] GeoSAM: Fine-tuning SAM with Multi-Modal Prompts for Mobility Infrastructure Segmentation

Rafi Ibn Sultan, Chengyin Li|arXiv (Cornell University)|Nov 19, 2023

Advanced Neural Network Applications被引用 9

一句话总结

GeoSAM 使用来自零-shot SAM 的密集提示和来自域内 CNN 的稀疏提示，对 SAM 进行微调以分割地理影像中的移动基础设施，超越零-shot SAM 与 Tile2Net。

ABSTRACT

In geographical image segmentation, performance is often constrained by the limited availability of training data and a lack of generalizability, particularly for segmenting mobility infrastructure such as roads, sidewalks, and crosswalks. Vision foundation models like the Segment Anything Model (SAM), pre-trained on millions of natural images, have demonstrated impressive zero-shot segmentation performance, providing a potential solution. However, SAM struggles with geographical images, such as aerial and satellite imagery, due to its training being confined to natural images and the narrow features and textures of these objects blending into their surroundings. To address these challenges, we propose Geographical SAM (GeoSAM), a SAM-based framework that fine-tunes SAM using automatically generated multi-modal prompts. Specifically, GeoSAM integrates point prompts from a pre-trained task-specific model as primary visual guidance, and text prompts generated by a large language model as secondary semantic guidance, enabling the model to better capture both spatial structure and contextual meaning. GeoSAM outperforms existing approaches for mobility infrastructure segmentation in both familiar and completely unseen regions by at least 5\% in mIoU, representing a significant leap in leveraging foundation models to segment mobility infrastructure, including both road and pedestrian infrastructure in geographical images. The source code can be found in this GitHub Repository: https://github.com/rafiibnsultan/GeoSAM.

研究动机与目标

将 Segment Anything Model (SAM) 扩展到地理影像，以进行移动基础设施分割（道路与行人基础设施）。
开发基于参数高效微调（PEFT）的稀疏提示与密集提示微调流水线。
从领域特定的 CNN 编码器和零-shot 提示自动生成提示，以改进航空影像的分割。

提出的方法

使用冻结编码器的 SAM；仅通过 PEFT 微调解码器。
从基于 Tile2Net 的伪标签自动生成道路与行人类别的稀疏提示。
通过将图像特征嵌入转换为 SAM 友好的密集提示，从零-shot SAM 生成密集提示。
采用 Dice Focal 损失进行训练，结合 Dice 损失与 Focal 损失以处理类别不平衡。
端到端推理使用来自 CNN 编码器的稀疏提示和来自零-shot SAM 的密集提示，以及微调后的解码器。

Figure 1: Training GeoSAM, an automated mobility infrastructure segmentation pipeline. In Prompts Generation (orange arrows), the model generates the sparse and dense prompts with the help of a secondary CNN-based geographical image encoder. Sparse prompts are generated automatically from the output

实验结果

研究问题

RQ1是否可以在不完全重新训练的情况下，将 SAM 适配为地理影像中的多类移动基础设施分割？
RQ2来自领域 CNN 的稀疏提示与来自零-shot SAM 的密集提示的组合是否能提高道路与行人基础设施的分割准确性？
RQ3基于 PEFT 的 SAM 解码器微调在这一地理空间任务中的效果如何？
RQ4GeoSAM 对未见城市的泛化能力如何（例如在训练华盛顿特区而在剑桥市测试）？

主要发现

GeoSAM 在 Washington DC 测试集上比 Tile2Net 高出 17% 的 mIoU 和 21% 的 mAP。
GeoSAM 在道路与行人基础设施的 mIoU 与 mAP 上，显著超过零-shot SAM。
GeoSAM 在道路与行人两类的 mIoU 与 mAP 上，超越基于 CNN 与 ViT 的基线（如 UNet++、Swin UNETR），且幅度较大。
模型在通用化城市 Cambridge, MA 上表现略有下降，因数据分布偏移，但整体仍优于竞争基线。

Figure 2: Sparse prompts generated based on segmentation maps created by the pre-trained CNN image encoder. Here, the foreground class is the sidewalk/crosswalk, blue and red circles represent foreground and background clicks respectively.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。