QUICK REVIEW

[논문 리뷰] GPSBench: Do Large Language Models Understand GPS Coordinates?

Thinh Hung Truong, Jey Han Lau|arXiv (Cornell University)|2026. 02. 18.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

GPSBench는 14개의 LLM에서 고유 지리공간 추론을 평가하기 위해 17개 작업에 걸친 57,800개의 샘플로 구성된 대규모 벤치마크를 제시합니다. 기본 GPS 계산에서 강점을 보이는 반면 미세한 지리적 위치 식별 및 구면 기하학에서 약점을 나타내며, 지리 지식은 더 세밀한 수준에서 저하됩니다.

ABSTRACT

Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs' ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state-of-the-art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real-world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country-level performance but weak city-level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS-coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade-offs between gains in geometric computation and degradation in world knowledge. Our dataset and reproducible code are available at https://github.com/joey234/gpsbench

연구 동기 및 목표

도구 사용 없이 LLM의 고유 지리공간 추론 능력 평가.
기하 좌표 연산에 대한 성능과 응용 지리적 추론 비교 평가.
지리적 세분성 및 좌표 노이즈에 대한 강인성 분석.
GPS 증강 및 미세 조정이 하위 GPS 작업에 미치는 영향 조사.

제안 방법

GPSBench를 57,800개의 샘플과 17개 작업으로 도입(순수 GPS 및 응용 트랙).
그루터프 표준 기하학적 공식을 이용한 WGS84 타원체 및 GeoNames 유도 데이터로 진위값 산출.
연쇄적 사고나 소수 예제 없이 제로샷 프롬프트에서 14개 최첨단 LLM을 평가.
다지선다 문제의 정확도와 수치 문제의 1−MAPE를 단일 지표로 사용.
지역별 및 세분성 기반 성능, 좌표 노이즈에 대한 강인성, 증강/미세조정의 효과 분석.

실험 결과

연구 질문

RQ1현재 LLM의 고유 GPS 좌표 계산(거리, 방향, 변환) 및 응용 지리 추론의 역량은 어느 수준인가?
RQ2지리적 세분성(국가 대 주/도/시)이 성능에 어떤 영향을 미치는가?
RQ3GPS 좌표를 사용한 프롬프트 증강이 하위 공간 추론 벤치마크를 개선할 수 있는가?
RQ4제로샷 성능에 비해 미세조정의 GPS 추론 영향은 어떠한가?
RQ5모델 규모가 GPS 추론 능력에 어떤 영향을 미치는가?

주요 결과

모델은 순수 GPS 계산보다 응용 지리 추론에서 전반적으로 더 높은 성능을 보임.
GPT-5.1의 순수 GPS 정확도는 84.4%이며, 응용 정확도는 GPT-5-mini(74.1%)와 Gemini-2.5-Pro(73.4%)에서 가장 높음.
지리 지식은 계층적으로 저하: 국 가 수준의 정확도는 높지만 도시 수준의 정확도는 종종 25% 미만임.
좌표 노이즈에 대한 강인성은 진정한 이해를 반영하며, 국 가 정확도는 약 79–82%, 주/도는 46–52%, 도시는 6–9%임.
GPS 증강은 하위 작업(MapEval +6.1%, 계층적 공간 +22.7%)을 향상시키고, 미세조정은 기하 계산은 개선하나 세계 지식 작업에는 악영향을 미침.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.