QUICK REVIEW

[논문 리뷰] Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining

Yuxuan Li, Yuming Chen|arXiv (Cornell University)|2026. 03. 02.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

BabelRS는 Concept-Shared Instruction Aligning과 Layerwise Visual-Semantic Annealing을 포함한 언어 피벗 기반 사전 학습으로 모달리티 정렬을 탐지로부터 분리하여, RGB, SAR, 적외선 전반에 걸친 안정적인 학습과 최첨단 결과를 달성합니다.

ABSTRACT

Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.

연구 동기 및 목표

이종 다중 모달 RS 탐지에서 late-alignment의 불안정성에 대한 동기를 제시하고 정렬을 작업 학습에서 분리함으로써 이를 감소시킨다.
모달리티를 지시-따라 학습으로 정렬하기 위해 BabelRS를 제안한다.
층별 다중 스케일 시각-시맨틱 어닐링 메커니즘으로 시맨틱 정렬과 조밀한 탐지를 연결한다.
사전 학습 후 간단한 공동 탐지 목적어로 모달리티에 독립적인 미세조정을 가능하게 한다.
균형된 교차 모달 성능을 평가하기 위한 지표(Harmonic Modality mAP)를 제안한다.

제안 방법

Concept-Shared Instruction Aligning (CSIA)는 이미지들을 RGB, SAR, 적외선에서 공유된 언어적 개념으로 매핑하기 위해 지시-따라 목표를 활용하는 사전 학습된 대형 언어 모델을 시맨틱 피벗으로 사용한다.
Layerwise Visual-Semantic Annealing (LVSA)는 다중 스케일 ViT 특징을 점진적으로 언어 정렬 공간에 융합시켜 밀집 탐지의 입자 차이를 해결한다.
교차하는 다중 모달 RS 데이터셋에서 공간적으로 정렬된 이미지 페어가 필요하지 않은 상태로 사전 학습을 수행한다.
미세조정은 공유 백본과 모달리티별 헤드를 가진 간단한 공동 탐지 목적어를 사용하며, 추가 정합 손실은 필요하지 않다.
Harmonic Modality mAP (H-mAP)는 모달리티별 mAP의 조화 평균으로, 어떤 모달리티에서도 약한 성능을 벌점한다.

Figure 1 : Conceptual comparison between (a) late alignment and (b) early, language-pivoted alignment paradigms for heterogeneous multi-modal remote sensing detection. Late alignment (a) entangles modality alignment with task optimization during fine-tuning, leading to gradient conflicts and unstabl

실험 결과

연구 질문

RQ1언어 피벗 기반 사전 학습이 이종 RS 모달리티에서 공간적으로 쌍을 이루지 않는 교차-모달 정렬을 가능하게 할 수 있는가?
RQ2초기 시맨틱 정렬이 최적화의 안정성과 일반화에 의해 late-alignment 방법과 비교해 향상되나?
RQ3Layerwise Visual-Semantic Annealing이 모달리티 간 다중 스케일 가이던스를 충분히 제공하는가?
RQ4언어 피벗 기반 사전 학습 후 간단한 공동 미세조정으로 다중 모달 RS 탐지가 충분히 가능하나?
RQ5H-mAP가 균형 잡힌 교차 모달 성능 평가에 견고한 지표인가?

주요 결과

BabelRS는 AMP 하에서 파인튜닝 동안 안정적 최적화를 달성하며, 여러 late-alignment 기초기보다 우수한 성능을 보인다.
이전의 사전 학습 전략과 비교했을 때, BabelRS는 SOI-Det 벤치마크에서 RGB, SAR, 적외선 모두에서 우수한 성능을 제공한다.
공유 프로젝터를 가진 LVSA 기반 특징 융합은 단순한 중간 레이어 병합 전략보다 우수하다.
BabelRS는 일반적인 사전 학습이 자주 저조한 SAR 및 적외선 도메인에서 강력한 이득을 보인다.
제안된 H-mAP 지표는 글로벌 mAP보다 교차 모달 신뢰도를 더 잘 반영한다.

Figure 2 : Automatic Mixed Precision fine-tuning stability on SOI-Det dataset. Many existing models experience gradient explosion before completion, whereas BabelRS remains stable throughout fine-tuning.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.