QUICK REVIEW

[论文解读] SpectralGPT: Spectral Remote Sensing Foundation Model

Danfeng Hong, Bing Zhang|arXiv (Cornell University)|Nov 13, 2023

Remote-Sensing Image Classification被引用 60

一句话总结

SpectralGPT 是一个光谱遥感基础模型，采用3D生成预训练变换器构建，在超过一百万张 Sentinel-2 图像上训练，并在场景分类、语义分割和变化检测上进行评估。它实现了最先进的结果，并支持在不同图像尺寸和数据集之间的渐进式训练。

ABSTRACT

The foundation model has recently garnered significant attention due to its potential to revolutionize the field of visual representation learning in a self-supervised manner. While most foundation models are tailored to effectively process RGB images for various visual tasks, there is a noticeable gap in research focused on spectral data, which offers valuable information for scene understanding, especially in remote sensing (RS) applications. To fill this gap, we created for the first time a universal RS foundation model, named SpectralGPT, which is purpose-built to handle spectral RS images using a novel 3D generative pretrained transformer (GPT). Compared to existing foundation models, SpectralGPT 1) accommodates input images with varying sizes, resolutions, time series, and regions in a progressive training fashion, enabling full utilization of extensive RS big data; 2) leverages 3D token generation for spatial-spectral coupling; 3) captures spectrally sequential patterns via multi-target reconstruction; 4) trains on one million spectral RS images, yielding models with over 600 million parameters. Our evaluation highlights significant performance improvements with pretrained SpectralGPT models, signifying substantial potential in advancing spectral RS big data applications within the field of geoscience across four downstream tasks: single/multi-label scene classification, semantic segmentation, and change detection.

研究动机与目标

弥补面向光谱遥感数据的基础模型的空白。
开发一个3D掩蔽、基于变换器的预训练框架，以捕捉时空-谱耦合和谱序列性。
实现跨多样化遥感数据集和不同图像尺寸的渐进式预训练，以实现鲁棒泛化。
在单标签和多标签分类、语义分割以及变化检测方面，展现超越SOTA的改进。
引入一个用于遥感任务的新的城市语义分割基准数据集（SegMunich）。

提出的方法

引入 SpectralGPT，一个基于3D掩蔽自编码器的基础模型，专为光谱遥感数据在类似 MAE 的框架内定制。
对 H×W×D 数据应用90%的3D张量掩蔽，以对空间-光谱标记进行建模。
使用编码器从可见标记中学习时空-光谱表示，使用轻量解码器进行多目标重建（token-to-token 和 spectral-to-spectral）。
在一个基于 Sentinel-2 的大型数据集（超过1M张图像）上进行渐进式预训练，跨不同大小、分辨率、时间序列和区域的数据集。
采用两个可学习的位置嵌入（空间和光谱）以及基于 ViT 的骨干网络，使用 8×8×3 的分词；使用 AdamW 和余弦退火，在 200 个周期（fMoW-S2）然后 100 个周期（BigEarthNet-S2）进行训练。
在下游任务上对预训练的 SpectralGPT 与 SpectralGPT+ 进行微调评估：单标签 EuroSAT（准确率）、多标签 BigEarthNet-S2（宏/micro mAP）、语义分割（OA 与 mIoU）、变化检测（精确度/召回/F1）。

实验结果

研究问题

RQ1一个3D掩蔽生成式预训练框架是否能够在高光谱数据中捕捉时空-光谱耦合和谱序列信息？
RQ2在不同光谱遥感数据集上进行渐进式预训练是否会提升下游任务的表现和泛化能力？
RQ3SpectralGPT 与面向 RGB 的基础模型及以往的光谱预训练方法在遥感基准上的对比如何？
RQ4模型规模（Base/Large/Huge）和掩蔽策略对下游遥感任务有何影响？
RQ5一个新的 SegMunich 基准是否能促进城市场景下的语义分割研究？

主要发现

SpectralGPT/Base 在 EuroSAT 上的准确率为 99.15%，使用 fMoW-S2 预训练；在 fMoW-S2 加 BigEarthNet 的训练下，准确率提升至 99.21%。
在 EuroSAT 的单标签分类中，SpectralGPT 超越 ResNet50、SeCo、ViT 和 SatMAE 基线，在可比设定下。
在 BigEarthNet-S2 上，SpectralGPT 变体的宏观/微观 mAP 高于 ViT/ImageNet-22k 和 SatMAE 基线，SpectralGPT+ 达到 88.22% 宏观 mAP 和 87.50% 微观 mAP（报告值）。
该模型使用 90% 掩蔽率的 3D 掩蔽和多目标重建，提升对时空-光谱和谱序列模式的学习。
渐进式预训练使其能够处理不同大小、分辨率、时间序列和区域的输入图像，在遥感数据上实现更好的泛化。
为城市场景的语义分割整理了一个具有13个类别的新 SegMunich 基准数据集，用于评估下游遥感性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。