QUICK REVIEW

[論文レビュー] Foundation Models for Generalist Geospatial Artificial Intelligence

Johannes Jakubik, Sujit Roy|arXiv (Cornell University)|Oct 28, 2023

Flood Risk Assessment and Management被引用数 11

ひとこと要約

Prithviを紹介します。100M-parameter geospatial foundation modelを、1TBのHarmonized Landsat-Sentinel-2データで事前学習させ、クラウドギャップの補完、洪水/山火事/作物セグメンテーション、そしてHuggingFaceでのオープンソース公開によるデータ効率学習のファインチューニング成功を示します。

ABSTRACT

Significant progress in the development of highly adaptable and reusable Artificial Intelligence (AI) models is expected to have a significant impact on Earth science and remote sensing. Foundation models are pre-trained on large unlabeled datasets through self-supervision, and then fine-tuned for various downstream tasks with small labeled datasets. This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive geospatial data. We have utilized this framework to create Prithvi, a transformer-based geospatial foundational model pre-trained on more than 1TB of multispectral satellite imagery from the Harmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the efficacy of our framework in successfully fine-tuning Prithvi to a range of Earth observation tasks that have not been tackled by previous work on foundation models involving multi-temporal cloud gap imputation, flood mapping, wildfire scar segmentation, and multi-temporal crop segmentation. Our experiments show that the pre-trained model accelerates the fine-tuning process compared to leveraging randomly initialized weights. In addition, pre-trained Prithvi compares well against the state-of-the-art, e.g., outperforming a conditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%) in the structural similarity index. Finally, due to the limited availability of labeled data in the field of Earth observation, we gradually reduce the quantity of available labeled data for refining the model to evaluate data efficiency and demonstrate that data can be decreased significantly without affecting the model's accuracy. The pre-trained 100 million parameter model and corresponding fine-tuning workflows have been released publicly as open source contributions to the global Earth sciences community through Hugging Face.

研究の動機と目的

Motivate the use of foundation models in geoscience to handle unlabeled, multi-sensor remote sensing data at scale.
Develop a scalable framework for pre-training and fine-tuning geospatial foundation models directly on large multispectral time-series data.
Demonstrate Prithvi’s ability to adapt to diverse downstream tasks with limited labeled data and assess data efficiency.
Provide open-source access to model weights, architectures, and inference tooling to accelerate the Earth sciences community.

提案手法

Propose a distributed, scalable framework linking data discovery, preprocessing, pretraining, and inference for geospatial data.
Pre-train Prithvi (100M parameters) using masked autoencoder (MAE) with a ViT backbone on six HLS bands.
Extend MAE with 3D spatiotemporal embeddings (3D positional and 3D patch embeddings) to handle multi-temporal, multi-spectral inputs.
Use Song-based data loading with Zarr for efficient streaming and reduced I/O bottlenecks during pretraining.
Fine-tune the pretrained encoder (decoder heads task-specific) for downstream tasks using mmsegmentation with task-tailored heads and loss functions.
Evaluate data loading efficiency and compare fine-tuning strategies (full model, decoder-only, no-pretraining baselines).

実験結果

リサーチクエスチョン

RQ1RQ1: What factors are key in designing and evaluating foundation models in geoscience?
RQ2RQ2: How can we efficiently pre-train a foundation model on repetitive, noisy remote sensing data while removing noise and redundancies?
RQ3RQ3: Can foundation models leverage diverse training features to generalize across geoscience domains with significantly less labeled data?

主な発見

batch/GPU	workers	prefetch	epoch avg time (s)
GeoTiff 64 GPUs	16	1	384
GeoTiff 8 GPUs	128	8	690
Zarr 8 GPUs	128	2	381

Pre-trained Prithvi accelerates fine-tuning relative to randomly initialized weights.
Prithvi outperforms a conditional GAN in multi-temporal cloud imputation by up to 5 percentage points (5.7% in SSI).
Stratified sampling over geospatial statistics yields diverse pretraining data, reducing bias from overrepresented landscapes.
Zarr-based data loading substantially speeds epoch times, outperforming GeoTiff loading and enabling scale to large GPU clusters.
Pretraining with MAE on 3D spatiotemporal patches effectively handles multi-temporal, multispectral satellite imagery.
Open-source release of Prithvi-100M weights and frameworks on HuggingFace supports reproducibility and community collaboration.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。