QUICK REVIEW

[論文レビュー] Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining

Yuxuan Li, Yuming Chen|arXiv (Cornell University)|Mar 2, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

tldr: BabelRSはConcept-Shared Instruction AligningとLayerwise Visual-Semantic Annealingを用いた言語軸の事前学習を導入し、検出とモダリティ整列を分離して安定した学習とRGB、SAR、赤外線の最先端結果を達成します。

ABSTRACT

Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.

研究の動機と目的

heterogeneous multi-modal RS検出における後期整列の不安定さを動機付け、それをタスク学習からの整列を分離することで低減する。
命令遵守事前学習を用いてモダリティを共有言語概念空間に整列させるための BabelRS を提案する。
局所的層別の視覚-意味アニーリング機構で意味的整列と密集検出を橋渡しする。
事前学習後に単純な共同検出目標でモダリティ非依存のファインチューニングを可能にする。
バランスのとれたクロスモーダル性能を評価する指標（ハーモニックモダリティmAP）を提案する。

提案手法

Concept-Shared Instruction Aligning (CSIA) は、前訓練済みの大規模言語モデルを意味的ピボットとして利用し、RGB、SAR、赤外から共有言語概念へのマッピングを指示遵守目的で行う。
Layerwise Visual-Semantic Annealing (LVSA) は、階層的に多スケールのViT特徴を言語整列空間へ逐次統合し、密集検出の粒度不一致に対処する。
事前学習は空間的に整列した画像ペアを要求せず、分離したマルチモーダルRSデータセット上で実施する。
ファインチューニングは共有バックボーンとモダリティ固有ヘッドを用いた単純な共同検出目的で行い、追加の整列損失は使用しない。
ハーモニックモダリティmAP (H-mAP) は、モダリティ固有のmAPの调和平均として定義され、いずれかのモダリティの性能低下を抑制する。

Figure 1 : Conceptual comparison between (a) late alignment and (b) early, language-pivoted alignment paradigms for heterogeneous multi-modal remote sensing detection. Late alignment (a) entangles modality alignment with task optimization during fine-tuning, leading to gradient conflicts and unstabl

実験結果

リサーチクエスチョン

RQ1言語軸の事前学習は、空間的に対を成さない異種RSモダリティ間でクロスモーダル整列を可能にするか。
RQ2早期意味的整列は、後期整列法と比べて最適化の安定性と一般化を改善するか。
RQ3Layerwise Visual-Semantic Annealing は、モダリティを跨ぐ密集オブジェクト検出に対して効果的な多尺度案内を提供するか。
RQ4言語軸の事前学習後の単純な共同ファインチューニングで多-modal RS検出は十分か。
RQ5H-mAP はクロスモーダルの信頼性を評価する頑健な指標か。

主な発見

BabelRS は AMP 下でファインチューニング中の安定した最適化を達成し、複数の後期整列ベースラインと異なる。
従来の事前学習戦略と比較して、BabelRS は SOI-Det ベンチマークでRGB、SAR、赤外の各モダリティで優れた性能を示す。
共通の射影器を用いたLVSA対応の特徴融合は、単純な中間層統合戦略より優れている。
BabelRS は、汎用的な事前学習がしばしば性能不足となるSARおよび赤外領域で強い利得を示す。
提案するH-mAPは、グローバルmAPよりクロスモーダルの信頼性をより良く反映する。

Figure 2 : Automatic Mixed Precision fine-tuning stability on SOI-Det dataset. Many existing models experience gradient explosion before completion, whereas BabelRS remains stable throughout fine-tuning.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。