QUICK REVIEW

[論文レビュー] OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

Ronghao Fu, Haoran Liu|arXiv (Cornell University)|Mar 10, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

OmniEarth は RSVLM を地理空間設定における認識、推論、頑健性の3領域で評価する28タスクのバイアス対応ベンチマークを導入し、9,275枚の画像と44,210の指示を用いて9つのモデルをゼロショットで評価します。

ABSTRACT

Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.

研究の動機と目的

地球観測文脈における認識、推論、頑健性の RSVLM 能力を評価する。
マルチソースの地理空間データと時間的ダイナミクスを備えた細粒度のバイアス対応ベンチマークを提供する。
現行の RSVLM と地理空間タスク要件とのギャップを特定し、今後の研究を導く。

提案手法

28 の地理空間タスクを認識、推論、頑健性に分類する階層的分類体系を定義する。
多様なタスクタイプをカバーするため、4つの出力形式（MCQ、オープンエンド、境界ボックス、マスク）を使用する。
偏りを減らし現実味を確保するため、9,275 枚の画像（JL-1 を含む）と 44,210 の手動指示を収集・整理する。
視覚的基盤付けと語彙 priors を分離するため、ブラインドテストプロトコルと意味論的一貫性チェックを採用する。
専門家による手動検証を伴うデータセット駆動型およびタスク駆動型のアプローチでタスクを構築する。
RS 特化型、汎用、RS 専門 VLMs をゼロショット設定で評価する（19モデル）。

実験結果

リサーチクエスチョン

RQ1現在の VLM が地理空間タスクにおいて視覚的根拠づけを効果的に行えるか、それとも言語 priors に依存しているか。
RQ2RSVLM が局在化とセグメンテーションを要する細粒度認識タスクでどのように性能を発揮するか。
RQ3時系列・ドメイン特有の推論タスクにおける RSVLM の限界は何か。
RQ4劣化した画像やクロスモーダル入力（例：RGB–SAR）に対して RSVLM はどれだけ頑健か。
RQ5RSVLM の根拠付け、一貫性、地理空間推論を改善するためのギャップは何か。

主な発見

Method	Perception (12 Tasks)	Reasoning (12 Tasks)	Robustness (4 Tasks)	Notes
Specialized Encoders – SkyCLIP-ViT-B	8.4	24.1	25.0	-	-
Specialized Encoders – RemoteCLIP-ViT-B	49.8	84.8	45.7	-	-
Specialized Encoders – GeoRSCLIP	72.6	81.1	54.3	-	-
General Close-source – GLM-4.6V	60.1	70.9	67.3	120.4	-
General Close-source – Claude-sonnet-4	62.7	81.3	82.9	137.3	-
General Close-source – Gemini-2.0-Flash	71.3	82.8	85.5	150.7	-
GPT-4o – GPT-4o	65.8	89.3	87.1	151.9	-
Open-source General – Qwen2.5-VL-72B	59.8	80.5	75.5	80.0	-

既存の VLM は画像レベルの認識では優れているが、細粒度の局在化とセグメンテーションには苦戦する。
推論能力は限定的で、特に時系列およびドメイン特有のタスクで制限が見られる。
劣化した画像や未知のモダリティ（例：クロスモーダル RGB–SAR）ではモデルの頑健性が低い。
ブラインド評価では多くの RSVLM が視覚的証拠よりもテキストに依存しており、根拠付けが弱いことを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。