QUICK REVIEW

[论文解读] GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery

Han Jinzhen, JinByeong Lee|arXiv (Cornell University)|Feb 10, 2026

Remote-Sensing Image Classification被引用 0

一句话总结

GeoFormer 共同在 100 m 网格上基于 Sentinel-1/2 与开源 DEM 数据预测场景级建筑高度和占地面积，具备强泛化能力并公开代码与模型。

ABSTRACT

Accurate three-dimensional urban data are critical for climate modelling, disaster risk assessment, and urban planning, yet remain scarce due to reliance on proprietary sensors or poor cross-city generalisation. We propose GeoFormer, an open-source Swin Transformer framework that jointly estimates building height (BH) and footprint (BF) on a 100 m grid using only Sentinel-1/2 imagery and open DEM data. A geo-blocked splitting strategy ensures strict spatial independence between training and test sets. Evaluated over 54 diverse cities, GeoFormer achieves a BH RMSE of 3.19 m and a BF RMSE of 0.05, improving 7.5% and 15.3% over the strongest CNN baseline, while maintaining under 3.5 m BH RMSE in cross-continent transfer. Ablation studies confirm that DEM is indispensable for height estimation and that optical reflectance dominates over SAR, though multi-source fusion yields the best overall accuracy. All code, weights, and global products are publicly released.

研究动机与目标

推动全球尺度、100 m 分辨率的可扩展城市三维数据需求，使用开放数据源。
开发一个可与 Sentinel-1/2 与 DEM 输入配合的 BH（建筑高度）与 BF（建筑占地面积）联合估计模型。
确保空间独立的训练/测试划分，以在跨城市场景下鲁棒评估泛化。
展示基于 Swin Transformer 的多任务架构在城市形态映射中的有效性。
提供开源代码、权重及全球产品以便广泛复用。

提出的方法

提出 GeoFormer，一种基于 Swin Transformer 的 100 m 分辨率 BH 和 BF 联合预测多任务模型。
将多源输入（Sentinel-1 SAR、Sentinel-2 光学、DEM）融合成 8 通道张量。
在中心 100 m 网格周围使用 3×3、5×5 或 9×9 的上下文窗口学习上下文特征。
从 Swin 输出中提取中心 token，并应用两个任务专用头（BH 回归使用 ReLU；BF 通过 sigmoid）。
使用结合自适应 Huber 损失的不确定性加权多任务损失进行训练。
采用空间感知数据划分策略（GeoSplit），确保严格的训练/测试独立性并防止信息泄漏。

实验结果

研究问题

RQ1一个基于 Swin Transformer 的多任务模型能否仅使用 Sentinel 图像和开放 DEM 数据，在 100 m 网格分辨率下联合预测建筑高度和占地面积？
RQ2多源数据融合（SAR、光学、DEM）相比单模态基线对 BH 和 BF 估计精度有何影响？
RQ3感受野大小对 BH/BF 精度和在不同城市形态下的泛化有何影响？
RQ4在跨城市、跨大洲以及灾后场景下，模型在不依赖专有数据或矢量输入的情况下能否良好泛化？
RQ5DEM 在高度估计与占地面积估计中分别起到什么作用？

主要发现

Model	RMSE	MAE	ME	NMAD	CC	R^2
UNet-MTL	3.45	1.64	-0.35	1.32	0.78	0.60
GeoFormer 3×3	3.35	1.60	-0.35	1.31	0.80	0.63
GeoFormer 5×5	3.19	1.53	-0.16	1.26	0.81	0.66
GeoFormer 9×9	3.37	1.58	-0.36	1.26	0.80	0.62
UNet-MTL	0.06	0.03	0.00	0.03	0.86	0.74
GeoFormer 3×3	0.05	0.03	-0.01	0.03	0.89	0.79
GeoFormer 5×5	0.05	0.03	0.00	0.03	0.90	0.80
GeoFormer 9×9	0.05	0.03	0.00	0.03	0.89	0.79

GeoFormer 在 54 个城市上实现 BH RMSE 3.19 m、BF RMSE 0.050，相对于最强的 CNN 基线在 BH 上提升 7.5%、在 BF 上提升 15.3%。
5×5 的感受野在测试的上下文大小中提供最佳的整体 BH/BF 精度和泛化能力。
消融实验表明 DEM 对高度估计不可或缺，光学数据对高度检索优于 SAR，多源融合带来最佳整体精度。
跨城市、跨大洲和灾后评估表明 100 m GeoFormer 方法具有鲁棒泛化能力。
当模型容量超出某个点时会过拟合且泛化变差；过多上下文会导致过度平滑。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。