QUICK REVIEW

[論文レビュー] Enabling Training-Free Text-Based Remote Sensing Segmentation

Jose Sosa, Danila Rukhovich|arXiv (Cornell University)|Feb 19, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

The paper introduces two training-free pipelines that combine pretrained vision-language models with SAM to perform open-vocabulary, referring, and reasoning-based remote sensing segmentation, achieving state-of-the-art zero-shot results and offering lightweight LoRA-tuned improvements for complex prompts.

ABSTRACT

Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero-shot text-guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training-free or lightweight LoRA-tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM's grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a LoRA-tuned Qwen-VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open-vocabulary, referring, and reasoning-based tasks, demonstrate the strong capabilities of our approach. Code will be released at https://github.com/josesosajs/trainfree-rs-segmentation.

研究の動機と目的

Determine how far text-based remote sensing segmentation can go without task-specific training by leveraging only pretrained foundation models.
Propose two pipelines that integrate contrastive and generative VLMs with SAM to cover OVSS, referring, and reasoning segmentation in a training-free setting.
Evaluate zero-shot and lightweight LoRA-tuned performance across a broad set of remote sensing benchmarks.
Show that a fully training-free contrastive VLM + SAM approach attains state-of-the-art OVSS results, and that a LoRA-tuned generative VLM + SAM pipeline achieves SOTA in referring and reasoning segmentation.

提案手法

Contrastive VLMs (e.g., CLIP) act as mask selectors over SAM’s grid-based proposals to achieve fully training-free open-vocabulary semantic segmentation (OVSS).
Generative VLMs (e.g., GPT-5, Qwen-VL) generate spatial prompts (clicks) for SAM to perform referring and reasoning-based segmentation; can be zero-shot or LoRA-fine-tuned with SAM frozen.
Zero-shot inference uses CLIP + SAM; for improved performance, a LoRA-tuned Qwen-VL backbone is trained to output prompts while keeping SAM frozen.
Training data for the generative VLM prompts is synthesized by converting ground-truth masks into click sequences via an iterative, interactive segmentation-inspired process.
For the generative VLM pipeline, a textual prompting scheme expresses positive/negative clicks to SAM, enabling flexible segmentation under complex prompts.
An ablation shows SAM scale and grid density (29x29 grid) yield best trade-off between accuracy and computation.

Figure 2 : Inference schemes of our segmentation approaches with (a) contrastive and (b) generative VLMs.

実験結果

リサーチクエスチョン

RQ1How far can text-based remote sensing segmentation be achieved without any additional trainable components beyond existing foundation models?
RQ2Can a contrastive VLM + SAM pipeline achieve state-of-the-art zero-shot open-vocabulary segmentation on remote sensing data?
RQ3Can a generative VLM + SAM pipeline handle referring and reasoning-based segmentation, and does lightweight LoRA fine-tuning improve performance while keeping SAM frozen?
RQ4What are practical design choices (SAM scale, grid density, number of clicks) that maximize performance across diverse RS datasets?

主な発見

Method	OEM	LoveDA	iSAID	Potsdam	Vaihingen	UAVid	UDD5	VDD	Avg.
SegEarth-OV [33]	40.3	36.9	21.7	48.5	40.0	42.5	50.6	45.3	39.2
Oracle	64.4	50.0	36.2	74.3	61.2	59.7	56.5	62.9	58.2
CLIP [50]	12.0	12.4	7.5	15.6	10.8	10.9	9.5	14.2	11.4
MaskCLIP [87]	25.1	27.8	14.5	33.9	29.9	28.6	32.4	32.9	27.2
SCLIP [64]	29.3	30.4	16.1	39.6	35.9	31.4	38.7	37.9	31.1
GEM [7]	33.9	31.6	17.7	39.1	36.4	33.4	41.2	39.5	32.3
ClearCLIP [29]	31.0	32.4	18.2	42.0	36.2	36.2	41.8	39.3	33.4
Ours	34.2	38.2	21.9	50.2	40.6	44.3	53.8	46.8	41.3

The contrastive VLM-based pipeline achieves state-of-the-art zero-shot OVSS across 19 RS benchmarks, outperforming zero-shot baselines and SegEarth-OV on most datasets.
On 9 single-class datasets, the contrastive method delivers competitive zero-shot performance and surpasses SegEarth-OV on several building/road/flood tasks.
The generative VLM-based pipeline, in zero-shot form, provides reasonable performance for referring and reasoning tasks, and LoRA fine-tuning with SAM frozen yields state-of-the-art results on RRSIS-D (referring) and EarthReason (reasoning).
Ablations show larger SAM scales (SAM-Large) and a 29x29 grid offer best performance; six training-time clicks for the generative VLMs significantly improve results.
Compared to task-specific training, the proposed training-free approach achieves strong generalisation across diverse RS modalities and geographies.

Figure 3 : Qualitative results of the training-free contrastive VLM pipeline on multi-class (first and second rows) and single-class (third row) OVSS tasks using remote sensing datasets.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。