[Paper Review] GPSBench: Do Large Language Models Understand GPS Coordinates?
GPSBench presents a large-scale benchmark (57,800 samples across 17 tasks) to evaluate intrinsic geospatial reasoning in 14 LLMs, showing strength in basic GPS computations but weaknesses in fine-grained geographic localization and spherical geometry, with geographic knowledge degrading at finer granularity.
Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs' ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state-of-the-art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real-world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country-level performance but weak city-level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS-coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade-offs between gains in geometric computation and degradation in world knowledge. Our dataset and reproducible code are available at https://github.com/joey234/gpsbench
Motivation & Objective
- Assess intrinsic geospatial reasoning capabilities of LLMs without tool use.
- Evaluate performance on geometric coordinate operations vs. applied geographic reasoning.
- Analyze geographic granularity and robustness to coordinate noise.
- Investigate effects of GPS augmentation and fine-tuning on downstream GPS tasks.
Proposed method
- Introduce GPSBench with 57,800 samples across 17 tasks (Pure GPS and Applied tracks).
- Ground truth computed via geodetic formulae on the WGS84 ellipsoid and GeoNames-derived data.
- Evaluate 14 state-of-the-art LLMs in zero-shot prompting without chain-of-thought or few-shot examples.
- Use accuracy for multiple-choice tasks and 1−MAPE for numerical tasks as unified metrics.
- Analyze regional and granularity-based performance, robustness to coordinate noise, and effects of augmentation/fine-tuning.
Experimental results
Research questions
- RQ1How capable are current LLMs at intrinsic GPS coordinate computations (distance, bearing, transformations) and applied geographic reasoning?
- RQ2How does geographic granularity (country vs. province/state vs. city) affect performance?
- RQ3Can augmenting prompts with GPS coordinates improve downstream spatial reasoning benchmarks?
- RQ4What is the impact of finetuning on GPS reasoning compared to zero-shot performance?
- RQ5How does model scale influence GPS reasoning capabilities?
Key findings
- Models show higher performance on applied geographic reasoning than pure GPS computations overall.
- GPT-5.1 achieves 84.4% Pure GPS accuracy; Applied accuracy is highest for GPT-5-mini (74.1%) and Gemini-2.5-Pro (73.4%).
- Geographic knowledge degrades hierarchically: country-level accuracy is high, city-level accuracy is often below 25%.
- Coordinate noise robustness indicates genuine understanding rather than memorization, with country accuracy ~79–82%, province 46–52%, city 6–9%.
- GPS augmentation improves downstream tasks (MapEval +6.1%, Hierarchical Spatial +22.7%), while finetuning improves geometric computation but hurts world-knowledge tasks.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.