QUICK REVIEW

[论文解读] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Aadi Palnitkar, Mengyun Mao|arXiv (Cornell University)|Jan 29, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

MilSCORE 引入基于场景的多跳地理空间基准，结合军事OPORDs来评估视觉-语言模型的长上下文推理能力，基线结果显示现有模型在高层级规划任务上存在困难。

ABSTRACT

As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs' ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.

研究动机与目标

评估源自现实军事OPORD场景的长上下文地理空间规划任务对视觉-语言模型的性能影响。
评估模型在地图、GeoJSON覆盖层及文本命令等多源多模态信息整合能力。
刻画当代VLMs在场景级军事决策中的失效模式与实际极限。
提供评估协议与基线结果，指导未来在长上下文、空间 grounding 方面的改进。

提出的方法

提出 MilSCORE，这是一个多模态数据集，包含基于现实训练OPORD场景的50张作战地图的专家撰写、具多跳推理的问题。
将任务分为三种难度等级和七个空间分析类别，灵感来自 ESRI 的空间分析分类。
使用可调用工具的思维链代理，在限定的10步推理预算内检索并检查来源（地图、OPORD PDFs、电子表格）。
在零-shot 与思维链推理两种模式下评测生产型VLM（GPT-4o、Claude Sonnet 4.5、Claude Haiku 4.5、Gemini 2.5 Flash）。
以基于正规化的子串包含度量对最终带框答案进行评分，并将工具错误视为错误。
提供一个带工具使用 ReAct 类循环的评估协议，包括通过 base64 载荷实现的图像整合以及受限的页面数量。

实验结果

研究问题

RQ1当代VLM是否能够对跨越多源异质信息的长上下文场景级军事地理推理？
RQ2MilSCORE中Tier 1（单跳）到Tier 3（跨源多跳）任务中模型能力的差异？
RQ3在处理长上下文地理空间规划任务时，最前沿VLMs的常见失效模式有哪些？
RQ4工具使用与思维链提示是否提升 MilSCORE 任务的 grounding 与准确性？
RQ5哪些基线性能可以引导未来开发更可靠、具空间感知能力的长上下文模型？

主要发现

模型	Tier 1	Tier 2	Tier 3	总计
GPT-4o	8（40%）	12（60%）	15（75%）	35（58.3%）
Claude Sonnet 4.5	6（30%）	7（35%）	0（0%）	13（21.7%）
Claude Haiku 4.5	6（30%）	4（20%）	6（30%）	16（26.7%）
Gemini 2.5 Flash	8（40%）	9（45%）	14（70%）	31（51.7%）

GPT-4o 在 MilSCORE 的60道题中总体准确率领先（35题正确，58.3%），尤其在Tier 3（75%正确）。
Claude Sonnet 4.5 与 Gemini 2.5 Flash 在Tier 3上的表现优于Tier 1/2，但受限于迭代预算与长推理循环。
Claude Haiku 4.5 总体取得16题正确（26.7%），Tier表现不均衡（Tier1 30%，Tier2 20%，Tier3 30%）。
在所有模型中，Tier 3的跨源多跳问题对多数模型更具挑战性，体现了整合地图、GeoJSON以及命令的难度。
结果揭示在真实、场景级军事规划基准上，当前系统仍有较大提升空间。
评估协议强调实际约束，如工具使用预算受限，以及对长上下文数据进行高效、 grounded 推理的需求。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。