QUICK REVIEW

[论文解读] DiffusionSat: A Generative Foundation Model for Satellite Imagery

Samar Khanna, Patrick Liu|arXiv (Cornell University)|Dec 6, 2023

Computational and Text Analysis Methods被引用 31

一句话总结

DiffusionSat 是首个面向卫星影像的大规模潜在扩散生成模型，基于文本和元数据进行条件化，以实现单图像生成以及对超分辨率、时序生成和修补的三维控制。

ABSTRACT

Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets. As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale generative foundation model for satellite imagery. The project website can be found here: https://samar-khanna.github.io/DiffusionSat/

研究动机与目标

为卫星影像的光谱、时序和元数据特征量身定制的基于扩散的生成模型的需求提供动机。
提出 DiffusionSat，一个在公开的高分辨率卫星数据集上训练的潜在扩散模型，利用元数据作为条件信号。
Develop a 3D conditioning extension (ControlNet-like) to enable tasks such as multi-spectral super-resolution, temporal generation, and in-painting.
证明 DiffusionSat 在卫星图像生成及相关反问题上达到最先进的结果。
提供一个公开的前训练数据集和可适应多样地理空间任务的训练协议。

提出的方法

使用潜在扩散框架（通过 VAE 进行下采样、在潜在空间中扩散、通过解码器进行上采样），权重初始化自 Stable Diffusion。
用正弦投影和逐项 MLP 对数值卫星元数据进行编码，并与时间步嵌入一起汇总成最终的条件向量。
当存在字幕时，将去噪网络条件化为 CLIP 式文本标题；否则依赖元数据和时间步条件。
引入受 3D-ControlNet 启发的条件化机制，以处理图像序列的时序生成，包括时序注意力和 SD 模块之间的 3D 零卷积。
在多数据集卫星影像（fMoW、Satlas、SpaceNet）及相关元数据（纬/经、时间戳、GSD、云量等）上进行训练。
将模型应用于单图像生成和下游条件任务：超分辨率、时序预测和修补，展示相对于基线的改进指标。

Figure 1: Conditioning on freely available metadata and using large, publicly available satellite imagery datasets shows DiffusionSat is a powerful generative foundation model for remote sensing data.

实验结果

研究问题

RQ1是否可以在卫星影像上有效训练基于扩散的基础模型，使用元数据作为条件信号以实现高质量的单图像生成？
RQ2是否一个 3D 条件框架可以在遥感数据上实现多任务生成的可靠性，包括超分辨率、时序预测和修补？
RQ3相较于仅文本条件，元数据感知条件对生成质量和控制在卫星影像中的影响如何？
RQ4在卫星数据上适配的预训练潜在扩散权重是否比从头训练的模型在下游反问题上表现更好？
RQ5模型在不同 GSD 和光谱带的 fMoW、Satlas、SpaceNet 等多样卫星数据集上的泛化能力如何？

主要发现

DiffusionSat 在单图像卫星生成方面取得了强烈的视觉与感知质量，在 FID、IS 与 CLIP 分数上优于基线。
通过正弦嵌入和逐项 MLP 对数值元数据进行编码，优于仅使用字幕条件的生成质量。
3D 条件方法在下游任务（包括多光谱超分辨率、时序生成和修补）中实现了最先进或具竞争力的性能。
DiffusionSat 在时序预测和修补基准测试中相对于如 STSR 和 MCVD 的基线，在 LPIPS 更好且在 SSIM/PSNR 方面具有竞争力，且跨多个数据集表现良好。
在大型公开卫星数据集上进行预训练并冻结大部分 Stable Diffusion 权重，仅训练去噪网络和元数据编码器，提升收敛速度并利用现有权重优势。

Figure 2: DiffusionSat flexibly extends to a variety of conditional generation tasks. We design a 3D version of a ControlNet (Zhang & Agrawala, 2023 ) which can accept a sequence of images. Like regular ControlNets, our 3D ControlNet keeps a trainable copy of SD weights for the downsampling and midd

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。