QUICK REVIEW

[论文解读] Large Language Models for Captioning and Retrieving Remote Sensing Images

João Daniel Silva, João Avelar Magalhães|arXiv (Cornell University)|Feb 9, 2024

Multimodal Machine Learning Applications被引用 14

一句话总结

RS-CapRet 使用冻结的大型语言模型（LLM）配合远程感知定制的视觉编码器和简单线性投影来描述远程感知图像并执行文本-图像检索，在多个 RS 数据集上达到最新状态或具有竞争力的结果。

ABSTRACT

Image captioning and cross-modal retrieval are examples of tasks that involve the joint analysis of visual and linguistic information. In connection to remote sensing imagery, these tasks can help non-expert users in extracting relevant Earth observation information for a variety of applications. Still, despite some previous efforts, the development and application of vision and language models to the remote sensing domain have been hindered by the relatively small size of the available datasets and models used in previous studies. In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval. We specifically propose to use a highly capable large decoder language model together with image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and language decoder, we propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can then generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving SOTA or competitive performance with existing methods. Qualitative results illustrate that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner.

研究动机与目标

将 Vision 和 Language 模型应用到远程感知领域，以实现地球观测信息的民主化访问。
通过冻结 LLM 和视觉编码器并训练轻量级投影层，开发一个简单且内存效率高的 RS-capable V&L 模型。
在同一框架中实现图像描述和文本检索。
证明 LLMs 可以描述 RS 图像并支持对图像和文本输入的互动、对话式处理。

提出的方法

使用冻结的大型语言模型（LLM）为远程感知图像生成描述。
在远程感知数据上微调基于 CLIP 的视觉编码器，以生成图像嵌入。
学习简单的线性投影层，将图像嵌入映射到 LLM 输入空间以及共享检索空间。
引入一个特殊的 [RET] 标记，通过图像嵌入与 [RET] 标记嵌入之间的对比学习实现文本-图像检索。
在图像描述和对比检索目标上联合训练，使用加权损失 L = λ_c L_c + λ_r (L_t2i2i + L_i2t2t)。
在训练中保持大部分参数冻结，只训练新增的线性层和 [RET] 标记嵌入以降低内存和训练成本。

实验结果

研究问题

RQ1冻结的 LLM 与针对遥感的视觉编码器结合是否能为 RS 图像生成准确描述？
RQ2将图像嵌入与 LLM 输入之间建立简单投影桥是否能在 RS 数据上有效实现跨模态检索？
RQ3在 Cap-4 数据上微调视觉编码器是否优于零-shot 或其他基线，提升描述和检索性能？
RQ4单一的 RS-CapRet 模型是否能在多种 RS 描述数据集（NWPU-Captions、RSICD、Sydney-Captions、UCM-Captions）上达到竞争性表现？

主要发现

RS-CapRet 在跨多个数据集的 RS 描述和检索基准上达到具有竞争力的或最先进的结果。
在 Cap-4 数据上微调视觉编码器相对于零-shot CLIP 变体，在检索任务上带来改进。
该方法支持交错的图像-文本对话，表明模型能够描述内容并对图像和文本的序列进行推理。
使用带对比学习的检索标记 [RET] 能通过在共享空间中对齐图像和 [RET] 嵌入来实现有效的文本到图像和图像到文本检索。
使用基于 CLIP 的骨干（CLIP-Cap-4）并以 LLamaV2 作为语言模型，在若干 RS 描述数据集上表现强劲。
训练过程保持 LLM 和视觉编码器冻结，仅更新轻量级投影层和 [RET] 标记嵌入，降低内存和计算成本。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。