QUICK REVIEW

[论文解读] UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction

Xixuan Hao, Wei Chen|arXiv (Cornell University)|Mar 25, 2024

Land Use and Ecosystem Services被引用 5

一句话总结

UrbanVLP 引入一种多粒度的 vision-language 预训练模型，将宏观的卫星数据与微观的街景信息融合，并具备自动文本生成与校准功能，以提升对城市指标预测的可解释性。

ABSTRACT

Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.

研究动机与目标

激发并解决仅依赖宏观卫星数据在城市指标预测中的偏差。
开发一个将宏观（卫星）和微观（街景）数据整合以获得丰富城市表征的视觉-语言预训练框架。
引入自动文本生成与校准机制，以提升预测的可解释性。
提供一个可扩展且可解释的基线与跨多种城市指标任务的基准。

提出的方法

两阶段框架：（i）使用 ShareGPT4V 与提示设计进行街景图像的自动文本生成，并基于 PerceptionScore 进行校准；（ii）使用双分支对比学习进行多粒度跨模态对齐（在卫星层面的全局图像-文本对齐，以及在街景层面的细粒度 token 级对齐）并加入位置嵌入。
对卫星流和街景流使用基于 ViT 的编码器，另外设一个单独的文本编码器和一个受 GeoCLIP 启发的定位编码器以注入地理坐标。
应用全局对比损失 LCG，包含图像到文本和文本到图像的项，以对齐卫星层面的表示；应用细粒度的 token-level 相似性和对比损失以将街景 token 与文本 token 对齐（LCL）。
融合卫星特征、聚合的街景特征和位置信息以形成区域表征；在冻结编码器后训练一个轻量级的 MLP 以进行后续城市指标预测。
第 2 阶段在冻结特征上进行线性探测，通过 Y = MLP(e_sa, e_sv, e_t) 预测城市指标。
引入 PerceptionScore（CLIPScore 加 CycleScore）来在无参照的情况下评估文本质量；CycleScore 使用文本到图像生成（SDXL）并随后进行基于分割的 MAE 以确保视觉-语义的一致性。

Figure 1. Single-granularity vs. Multi-granularity modeling.

实验结果

研究问题

RQ1RQ1：UrbanVLP 是否能超越基线并在城市指标任务之间实现泛化？
RQ2RQ2：卫星（宏观）与街景（微观）分支及定位编码如何贡献于性能？
RQ3RQ3：自动文本生成和校准对文本质量及后续预测有何影响？
RQ4RQ4：UrbanVLP 的实际可行性与部署可行性如何（例如通过网络平台）？

主要发现

UrbanVLP 在所报告任务上相较基线取得平均 R^2 提升 3.55% 的优越性能。
多粒度跨模态对齐同时利用宏观卫星与微观街景信息以增强区域表征。
基于感知的校准（PerceptionScore）的自动文本生成产生与图像内容对齐的更高质量描述。
在六个下游指标的实验中，UrbanVLP 展示出比包括 UrbanCLIP 变体和 PG-SimCLR 在内的若干基线更强的预测能力。
作者通过部署的网络平台验证了其实用性，展示了端到端的适用性。

Figure 2. $R^{2}$ performance on Beijing and Shenzhen dataset.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。