QUICK REVIEW

[논문 리뷰] GeoSAM: Fine-tuning SAM with Multi-Modal Prompts for Mobility Infrastructure Segmentation

Rafi Ibn Sultan, Chengyin Li|arXiv (Cornell University)|2023. 11. 19.

Advanced Neural Network Applications인용 수 9

한 줄 요약

GeoSAM은 제로샷 SAM의 조밀한 프롬프트와 도메인 CNN의 희소 프롬프트를 사용하여 지리 영상에서 도로 및 보행자 기반시설을 분할하도록 SAM을 미세조정하고, 제로샷 SAM 및 Tile2Net보다 성능이 우수합니다.

ABSTRACT

In geographical image segmentation, performance is often constrained by the limited availability of training data and a lack of generalizability, particularly for segmenting mobility infrastructure such as roads, sidewalks, and crosswalks. Vision foundation models like the Segment Anything Model (SAM), pre-trained on millions of natural images, have demonstrated impressive zero-shot segmentation performance, providing a potential solution. However, SAM struggles with geographical images, such as aerial and satellite imagery, due to its training being confined to natural images and the narrow features and textures of these objects blending into their surroundings. To address these challenges, we propose Geographical SAM (GeoSAM), a SAM-based framework that fine-tunes SAM using automatically generated multi-modal prompts. Specifically, GeoSAM integrates point prompts from a pre-trained task-specific model as primary visual guidance, and text prompts generated by a large language model as secondary semantic guidance, enabling the model to better capture both spatial structure and contextual meaning. GeoSAM outperforms existing approaches for mobility infrastructure segmentation in both familiar and completely unseen regions by at least 5\% in mIoU, representing a significant leap in leveraging foundation models to segment mobility infrastructure, including both road and pedestrian infrastructure in geographical images. The source code can be found in this GitHub Repository: https://github.com/rafiibnsultan/GeoSAM.

연구 동기 및 목표

이동 인프라 분할(도로 및 보행자 기반시설)을 위한 지리 영상에 Segment Anything Model(SAM) 확장.
매개변수 효율적 미세조정(PEFT)을 사용한 희소 프롬프트 및 조밀 프롬프트 기반 미세조정 파이프라인 개발.
항공 영상의 분할 성능 향상을 위해 도메인 특화 CNN 인코더와 제로샷 프롬프트에서 자동으로 프롬프트를 생성합니다.

제안 방법

고정된 인코더를 가진 SAM을 사용; 디코더만 PEFT를 통해 미세조정합니다.
도로 및 보행자 클래스에 대한 Tile2Net 기반 의사 라벨에서 자동으로 희소 프롬프트를 생성합니다.
제로샷 SAM의 이미지 특징 임베딩을 SAM 친화적인 조밀 프롬프트로 변환하여 조밀 프롬프트를 생성합니다.
Dice Loss와 Focal Loss를 결합한 Dice Focal 손실을 사용하여 클래스 불균형을 처리합니다.
엔드 투 엔드 추론은 CNN 인코더의 희소 프롬프트와 제로샷 SAM의 조밀 프롬프트를 이용하고 미세조정된 디코더를 사용합니다.

Figure 1: Training GeoSAM, an automated mobility infrastructure segmentation pipeline. In Prompts Generation (orange arrows), the model generates the sparse and dense prompts with the help of a secondary CNN-based geographical image encoder. Sparse prompts are generated automatically from the output

실험 결과

연구 질문

RQ1SAM을 전체 재학습 없이도 지리 영상의 다중 클래스 모빌리티 인프라 분할에 적응시킬 수 있는가?
RQ2,
RQ3]}]} {
RQ4key_findings":["GeoSAM은 Washington DC 테스트 세트에서 Tile2Net보다 mIoU 17%, mAP 21% 더 우수합니다.","GeoSAM은 도로 및 보행자 기반시설에 대해 제로샷 SAM보다 mIoU와 mAP 모두에서 현저히 우수합니다.","GeoSAM은 CNN 및 ViT 기반 벤치마크(예: UNet++, Swin UNETR)를 두 클래스 모두에서 mIoU 및 mAP에서 큰 폭으로 능가합니다.","데이터 시프트로 인해 일반화 도시(Cambridge, MA)에서 성능이 다소 하락하나, 평균적으로 경쟁 벤치마크보다 우수합니다."]],
RQ5table_headers:[],

Figure 2: Sparse prompts generated based on segmentation maps created by the pre-trained CNN image encoder. Here, the foreground class is the sidewalk/crosswalk, blue and red circles represent foreground and background clicks respectively.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.