QUICK REVIEW

[论文解读] Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining

Yuxuan Li, Yuming Chen|arXiv (Cornell University)|Mar 2, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

BabelRS 引入以语言为中枢的预训练，结合概念共享指令对齐和分层视觉-语义退火，以在检测中的解耦实现稳定训练，并在 RGB、SAR 和红外上达到最新水平。

ABSTRACT

Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.

研究动机与目标

动机：说明异质多模态 RS 检测中晚期对齐的不稳定性，并通过将对齐与任务学习解耦来降低不稳定性。
提出 BabelRS，通过指令遵循的预训练将模态对齐到共享的语言概念空间。
通过分层、多尺度的视觉-语义退火机制将语义对齐与密集检测连接起来。
在预训练后通过简单的联合检测目标实现模态无关的微调。
提出一个度量（谐模态 mAP，Harmonic Modality mAP）用于评估跨模态性能的平衡性。

提出的方法

Concept-Shared Instruction Aligning（CSIA）使用一个预训练的大型语言模型作为语义枢纽，通过指令遵循目标将来自 RGB、SAR 和红外的图像映射到共享的语言概念。
Layerwise Visual-Semantic Annealing（LVSA）逐层将多尺度 ViT 特征融入语言对齐空间，以解决密集检测的粒度不匹配问题。
在不需要空间对齐图像对的前提下，在不同的多模态 RS 数据集上进行预训练。
微调阶段使用一个简单的联合检测目标，采用共享骨干网和模态专用头部，无额外对齐损失。
谐模态 mAP（H-mAP）定义为各模态 mAP 的调和均值，用以惩罚任一模态性能过弱。

Figure 1 : Conceptual comparison between (a) late alignment and (b) early, language-pivoted alignment paradigms for heterogeneous multi-modal remote sensing detection. Late alignment (a) entangles modality alignment with task optimization during fine-tuning, leading to gradient conflicts and unstabl

实验结果

研究问题

RQ1语言中介预训练是否能在异质 RS 模态中无需空间配对数据实现跨模态对齐？
RQ2与晚对齐方法相比，早期语义对齐是否能提升优化稳定性与泛化能力？
RQ3LVSA 是否能为跨模态的密集目标检测提供有效的多尺度引导？
RQ4在语言中介预训练后，简单的联合微调是否足以实现多模态 RS 检测？
RQ5H-mAP 是否是评估跨模态性能平衡性的鲁棒度量？

主要发现

在 AMP 条件下，BabelRS 在微调阶段实现稳定优化，而多项晚对齐基线则不稳定。
与以往的预训练策略相比，BabelRS 在 SOI-Det 基准上对 RGB、SAR 和红外均表现出更优的性能。
在共享投影头下，LVSA 使特征融合效果优于简单的中间层合并策略。
BabelRS 在 SAR 与红外领域显示出显著提升，而这些领域的通用预训练往往效果欠佳。
所提出的 H-mAP 指标比全局 mAP 更能反映跨模态的可靠性。

Figure 2 : Automatic Mixed Precision fine-tuning stability on SOI-Det dataset. Many existing models experience gradient explosion before completion, whereas BabelRS remains stable throughout fine-tuning.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。