QUICK REVIEW

[论文解读] Vision-Language Models in Remote Sensing: Current Progress and Future Trends

Xiang Li, Congcong Wen|arXiv (Cornell University)|May 9, 2023

Multimodal Machine Learning Applications被引用 8

一句话总结

对遥感领域的视觉-语言模型（VLM）的全面评述，概述当前在遥感任务中的进展并提出未来研究方向，以弥合视觉与语义理解。

ABSTRACT

The remarkable achievements of ChatGPT and GPT-4 have sparked a wave of interest and research in the field of large language models for Artificial General Intelligence (AGI). These models provide intelligent solutions close to human thinking, enabling us to use general artificial intelligence to solve problems in various applications. However, in remote sensing (RS), the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research in remote sensing primarily focuses on visual understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-language models excel, as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. Vision-language models can go beyond visual recognition of RS images, model semantic relationships, and generate natural language descriptions of the image. This makes them better suited for tasks requiring visual and textual understanding, such as image captioning, and visual question answering. This paper provides a comprehensive review of the research on vision-language models in remote sensing, summarizing the latest progress, highlighting challenges, and identifying potential research opportunities.

研究动机与目标

从仅视觉模型演进到遥感中的视觉-语言模型的演变进行综述。
总结VLM在遥感任务中的应用，如图像描述、基于文本的图像生成、基于文本的图像检索、VQA、场景分类、语义分割以及目标检测。
讨论针对遥感数据的基础模型及预训练策略。
识别挑战并提出未来在遥感-VLM方面的研究方向。

提出的方法

将VLM架构分为融合编码器（fusion-encoder）和双编码器（dual-encoder）范式，并描述它们的交互机制。
解释与遥感相关的基础模型概念与预训练策略，包括有监督和自监督方法。
从现有文献中总结具有代表性的遥感特定VLM方法及其任务应用。
强调大语言模型与视觉变换器在塑造遥感VLM中的作用。
对未来遥感-VLM发展中的挑战与机遇进行综合性总结。

实验结果

研究问题

RQ1遥感领域关键RS任务中，当前视觉-语言模型的最前沿表现为何？
RQ2在RS应用中，融合编码器和双编码器VLM架构各自有哪些比较？
RQ3哪些基础模型策略（有监督 vs. 自监督）在RS数据上最有效？
RQ4阻碍RS-VLM部署的主要局限性有哪些，未来提出了哪些方向？

主要发现

视觉-语言模型能够对遥感图像中的对象及其关系进行推理，而不仅仅是进行简单的对象识别。
覆盖的遥感任务包括图像描述、基于文本的图像生成、基于文本的图像检索、VQA、场景分类、语义分割和目标检测。
基础遥感模型日益采用自监督和掩蔽图像建模等技术来利用未标注数据。
融合编码器和双编码器VLM架构在交互建模和效率方面各有权衡。
若干遥感特定数据集与基准促进了进展，基础模型如RingMo、CLIP风格方法以及BLIP-2等被视为代表性工作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。