QUICK REVIEW

[Paper Review] TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis

Zhengpeng Feng, Clement Atzberger|arXiv (Cornell University)|Jun 25, 2025

Image Retrieval and Classification Techniques30 citations

TL;DR

TESSERA learns 128-dimensional, pixel-wise representations from global 10m Sentinel-1 and Sentinel-2 time series using self-supervised learning, enabling strong performance on diverse downstream EO tasks with pre-computed global maps.

ABSTRACT

Satellite Earth-observation (EO) time series in the optical and microwave ranges of the electromagnetic spectrum are often irregular due to orbital patterns and cloud obstruction. Compositing addresses these issues but loses information with respect to vegetation phenology, which is critical for many downstream tasks. Instead, we present TESSERA, a pixel-wise foundation model for multi-modal (Sentinel-1/2) EO time series that learns robust, label-efficient embeddings. During model training, TESSERA uses Barlow Twins and sparse random temporal sampling to enforce invariance to the selection of valid observations. We employ two key regularizers: global shuffling to decorrelate spatial neighborhoods and mix-based regulation to improve invariance under extreme sparsity. We find that for diverse classification, segmentation, and regression tasks, TESSERA embeddings deliver state-of-the-art accuracy with high label efficiency, often requiring only a small task head and minimal computation. To democratize access, adhere to FAIR principles, and simplify use, we release global, annual, 10m, pixel-wise int8 embeddings together with open weights/code and lightweight adaptation heads, thus providing practical tooling for large-scale retrieval and inference at planetary scale. The model training/inference code, downstream task code, and pre-generated embeddings can be accessed at https://github.com/ucam-eo

Motivation & Objective

Motivate the need for high-resolution, temporally rich representations in Earth observation amid data gaps and labeling scarcity.
Propose a self-supervised, dual-encoder foundation model to fuse optical and SAR time series.
Generate global 10m, annual representations (2017–2024) and enable downstream tasks with fixed embeddings.
Demonstrate state-of-the-art performance across crop classification, canopy height estimation, burned area detection, biomass estimation, and carbon-market indices.
Provide open-source access and a model-as-data approach to lower barriers for practitioners.

Proposed method

Process unlabeled Sentinel-1 SAR and Sentinel-2 MSI time series per 10m pixel into modality-specific d-pixels (timesteps by channels).
Use two parallel Transformer encoders (one for SAR VV/VH, one for MSI spectra) with DOY-based temporal encodings and an attention-pooling layer to produce 128-dimensional per-modality representations.
Fuse modality embeddings with an MLP to form a 128-dimensional fused representation per pixel.
Expand fused representations to 16,384 dimensions with a large projector network.
Train with a modified Barlow Twins loss (L_BT + L_MIX) on cross-correlation of projected features, using two augmented views via sparse temporal sampling of annual observations.
During inference, freeze encoders to generate annual 10m representations for 2017–2024 and produce global representation maps.

Experimental results

Research questions

RQ1Can self-supervised, multi-modal temporal embeddings from Sentinel-1 and Sentinel-2 outperform traditional feature engineering and existing foundation models across diverse EO tasks?
RQ2Do global 10m, annual representations generalize to crop classification, canopy height, burned area, and biomass estimation, especially under low-label regimes?
RQ3How well do the learned representations capture temporal dynamics and disturbances (e.g., fires) without explicit preprocessing?
RQ4Does the open-source, precomputed representation map approach facilitate broader adoption and reproducibility in EO research?

Key findings

TESSERA representations yield state-of-the-art performance on downstream tasks compared to traditional baselines and other foundation models.
In crop type classification on the Austrian INVEKOS dataset, TESSERA with a simple MLP outperforms Random Forest and PRESTO embeddings across data regimes, including one-shot learning.
Canopy height estimation in tropical Danum Valley shows TESSERA achieving R^2 = 0.66, RMSE = 8.88 m, and bias = -0.62 m, outperforming global and regional CHM products.
Burned area analysis demonstrates that TESSERA embeddings separate burned vs. unburned areas and differentiate fire timing and severity in UMAP projections.
Across multiple tasks, TESSERA remains robust under limited labeled data, often surpassing or matching bespoke models.
The model supports a “Model-as-Data” paradigm with precomputed 10m representations, reducing preprocessing needs for end users.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.