QUICK REVIEW

[논문 리뷰] OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

Ronghao Fu, Haoran Liu|arXiv (Cornell University)|2026. 03. 10.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

OmniEarth는 시공간 지리공간 설정에서 지각, 추론, 로버스트니스에 걸친 RSVLM용 28개 태스크 편향 인식 벤치마크를 도입하며, 9,275장의 이미지와 44,210개의 지시를 포함하고, 0-shot 모드에서 19개의 모델을 평가한다.

ABSTRACT

Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.

연구 동기 및 목표

Earth observation 컨텍스트에서 지각, 추론, 로버스트니스에 걸친 RSVLM 능력을 평가한다.
다층 소스 지리공간 데이터와 시간적 역학을 갖춘 세밀하고 편향 인식 벤치마크를 제공한다.
현재 RSVLM과 지리공간 태스크 요건 간의 격차를 식별하여 향후 연구를 안내한다.

제안 방법

지각, 추론, 로버스트니스로 분류된 28개의 지리공간 태스크의 계층적 분류 체계를 정의한다.
다양한 태스크 유형을 포괄하기 위해 4가지 출력 형식(MCQ, 개방형, 경계 상자, 마스크)을 사용한다.
편향을 줄이고 현실감을 보장하기 위해 9,275장의 이미지(JL-1 포함)와 44,210개의 수작업 지시를 수집·정리한다.
시각적 기초화와 언어 편향(prior)을 구분하기 위해 블라인드 테스트 프로토콜과 시맨틱 일관성 검사을 채택한다.
전문가의 수동 검증으로 데이터세트 주도형 및 태스크 주도형 방식으로 태스크를 구성한다.
RS 특화 VLM, 일반 목적 VLM, RS 전문 VLM을 아우르는 19개 모델을 제로샷 설정에서 평가한다.

실험 결과

연구 질문

RQ1현대 VLM이 지리공간 태스크에서 시각적 증거를 효과적으로 접지할 수 있는가, 아니면 언어 편향에 의존하는가?
RQ2미세한 지각 태스크에서 위치화 및 세분화가 필요한 경우 RSVLM의 성능은 어떠한가?
RQ3시간적 및 도메인 특화 추론 태스크에서 RSVLM의 한계는 무엇인가?
RQ4 degraded imagery 및 교차 모달 입력(RGB–SAR 등)에 대해 RSVLM은 얼마나 강건한가?
RQ5 grounding, 일관성, 지리공간 추론을 개선하기 위한 간극은 무엇인가?

주요 결과

Method	Perception (12 Tasks)	Reasoning (12 Tasks)	Robustness (4 Tasks)	Notes
Specialized Encoders – SkyCLIP-ViT-B	8.4	24.1	25.0	-	-
Specialized Encoders – RemoteCLIP-ViT-B	49.8	84.8	45.7	-	-
Specialized Encoders – GeoRSCLIP	72.6	81.1	54.3	-	-
General Close-source – GLM-4.6V	60.1	70.9	67.3	120.4	-
General Close-source – Claude-sonnet-4	62.7	81.3	82.9	137.3	-
General Close-source – Gemini-2.0-Flash	71.3	82.8	85.5	150.7	-
GPT-4o – GPT-4o	65.8	89.3	87.1	151.9	-
Open-source General – Qwen2.5-VL-72B	59.8	80.5	75.5	80.0	-

기존 VLM은 일부 이미지 수준의 지각에 능숙하지만 미세한 위치화 및 세분화에서 어려움을 보인다.
추론 능력은 특히 시간적 및 도메인 특화 태스크에서 제한적이다.
모델의 로버스트니스는 열화된 이미징 및 미지의 모달리티(RGB–SAR 등)에서 약하다.
블라인드 평가에서 많은 RSVLM이 시각적 증거보다 텍스트에 의존하는 경향이 있어 기초화가 약함을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.