Skip to main content
QUICK REVIEW

[论文解读] RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Fan Liu, Delong Chen|arXiv (Cornell University)|Jun 19, 2023
Advanced Computational Techniques and Applications被引用 17
一句话总结

RemoteCLIP 是一个面向遥感的视觉-语言基础模型,通过 Box-to-Caption 和 Mask-to-Box 转换来扩展数据规模,以对齐图像-文本表示,从而在不同数据集上实现零-shot 和检索任务。

ABSTRACT

General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 $ imes$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $ extit{k}$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP

研究动机与目标

  • 解决遥感领域用于视觉-语言模型的预训练数据稀缺问题。
  • 学习具备丰富语义的鲁棒视觉特征,并获得与之对齐的文本嵌入,以用于下游任务。
  • 通过统一的图像-标题数据格式,在遥感领域实现零-shot、少样本和基于检索的应用。
  • 演示数据扩展技术,以利用异构注释进行大规模预训练。

提出的方法

  • 使用 CLIP 风格的对比预训练,包含图像编码器和文本编码器,以优化 InfoNCE 损失。
  • 通过 Box-to-Caption (B2C) 生成和 Mask-to-Box (M2B) 转换来统一异构遥感注释,以创建图像-文本对。
  • 通过整合 DET-10、SEG-4、RET-3 和 UAV 图像,将预训练数据规模扩展到原有开源数据集的 12 倍。
  • 在 16 个下游数据集上,评估 RemoteCLIP 在 zero-shot、线性探针、k-NN、少样本以及图像-文本检索任务上的表现。
  • 支持多种骨干网络规模(ResNet-50、ViT-Base-32、ViT-Large-14),以展示模型与数据规模的影响。
  • 提供一个新的 RemoteCount 基准,用于遥感中的目标计数。
Figure 1 : Averaged mean recall on three remote sensing image-text retrieval benchmarks: RSITMD, RSICD, and UCM ( RET-3 ). Key findings: (1) Zero-shot retrieval of large CLIP models ( e.g., ViT-G-14) outperforms all previous models specifically designed for remote sensing retrieval, except for the m
Figure 1 : Averaged mean recall on three remote sensing image-text retrieval benchmarks: RSITMD, RSICD, and UCM ( RET-3 ). Key findings: (1) Zero-shot retrieval of large CLIP models ( e.g., ViT-G-14) outperforms all previous models specifically designed for remote sensing retrieval, except for the m

实验结果

研究问题

  • RQ1在大规模统一的遥感图像-标题数据上训练的视觉-语言基础模型,是否能在多样化的下游任务中超越领域特定基线?
  • RQ2通过注释统一(B2C 和 M2B)的数据扩展,是否能实现更强的语义对齐和更好的零-shot/少样本性能?
  • RQ3不同的骨干规模如何影响 RemoteCLIP 在遥感中的分类和检索任务的性能?
  • RQ4在遥感领域对同领域数据进行持续预训练对基于 CLIP 的模型是否有益?
  • RQ5单一模型是否能够在多种遥感模态(卫星和无人机)和多种任务(计数、检索、分类)上达到出色表现?

主要发现

  • RemoteCLIP 的零-shot 分类性能在 12 个下游数据集上,平均准确率最高比 CLIP 基线提升至 6.39%。
  • 持续预训练(CLIP-CP)显著提升检索基准,在 RSITMD、RSICD 和 UCM 上创造了新的最先进水平。
  • 数据扩展到 RET-3/SEG-4/DET-10+ 数据集的 12 倍,相比较小尺度的持续预训练带来显著收益。
  • RemoteCLIP 在 16 个遥感数据集上,跨越从 ResNet-50 到 ViT-Large-14 的任务和模型规模,表现优于基线基础模型。
  • RSITMD 和 RSICD 的提升分别为平均召回率提高 9.14% 和 8.92%,相对于相关基线。
  • 该模型在零-shot、线性探针、k-NN、少样本分类以及图像-文本检索等任务上展现出多样性能力。
Figure 2 : Overview of the RemoteCLIP pipeline. Step 1 : RemoteCLIP is trained on a diverse collection of remote sensing datasets, covering 10 object detection datasets ( DET-10 , 6 of them are satellite imaginary datasets and 4 of them are UAV datasets), 4 remote sensing semantic segmentation datas
Figure 2 : Overview of the RemoteCLIP pipeline. Step 1 : RemoteCLIP is trained on a diverse collection of remote sensing datasets, covering 10 object detection datasets ( DET-10 , 6 of them are satellite imaginary datasets and 4 of them are UAV datasets), 4 remote sensing semantic segmentation datas

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。