QUICK REVIEW

[论文解读] Towards Governance-Oriented Low-Altitude Intelligence: A Management-Centric Multi-Modal Benchmark With Implicitly Coordinated Vision-Language Reasoning Framework

Hao Chang, Zhihui Wang|arXiv (Cornell University)|Jan 27, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

引入 GovLA-10K：面向低空治理的管理导向多模态基准；以及 GovLA-Reasoner：一个隐式特征适配器框架，协调视觉定位与大语言模型以实现治理导向的字幕生成，无需对探测器或 LLM 进行微调。

ABSTRACT

Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual grounding with high-level contextual language reasoning, GovLA-Reasoner introduces an efficient feature adapter that implicitly coordinates discriminative representation sharing between the visual detector and the large language model (LLM). Extensive experiments show that our method significantly improves performance while avoiding the need of fine-tuning for any task-specific individual components. We believe our work offers a new perspective and foundation for future studies on management-aware low-altitude vision-language systems.

研究动机与目标

将低高度感知从对对象的穷尽识别转向对治理相关异常的选择性理解。
提供一个聚焦于与城市治理相关的功能性显著目标的基准（GovLA-10K）。
开发一个统一的推理框架（GovLA-Reasoner），将定位与语言紧密结合且无需对组件进行微调。
实现基于视觉证据和治理规则的可操作管理建议。

提出的方法

通过公开无人机图像和内部飞行数据进行 GovLA-10K 数据采集，筛选后得到 10,572 张高质量图像。
九个功能性显著的类别（如非法停车车辆、施工废料、地面垃圾）以反映治理需求。
两阶段半自动标注：手工边界框与类别标签，然后使用 MMGroundingDINO（IoU 阈值 0.5）和 VLM 生成的字幕进行探测器辅助的验证。
使用结构化提示生成具有治理相关性与管理建议的上下文字幕，并由专家审核以确保治理相关性和准确性。
GovLA-Reasoner 引入一个轻量级特征适配器，将定位特征（F_img、F_query、F_decoder）进行压缩与聚合后输入给 LLM 进行端到端推理。
适配器训练是轻量级且任务特定的；仅对适配器进行训练，探测器与 LLM 保持冻结状态。

实验结果

研究问题

RQ1面向低空治理任务的管理导向型多模态基准的价值何在？
RQ2是否可通过一个带有隐式特征适配器的统一视觉-语言推理框架，在不对探测器或 LLM 进行微调的情况下提升治理聚焦字幕的表现？
RQ3功能性显著、治理驱动的目标在低空城市场景中相较于穷尽对象注释的表现如何？
RQ4对定位特征与语言推理之间的隐式协同是否能降低信息损失和在基于 VLM 的流程中的错误累积？

主要发现

Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr-D
LLaVA-OneVision-1.5-4B	36.27	21.24	12.56	7.61	19.10	25.36	4.84
LLaVA-OneVision-1.5-8B	30.61	18.28	10.86	6.78	17.25	24.82	2.69
InternVL3-8B	31.72	17.27	9.14	5.14	17.68	22.39	2.72
InternVL3.5-4B	31.01	17.01	9.34	5.28	17.33	22.61	2.71
InternVL3.5-8B	34.56	18.82	10.06	5.64	18.28	22.44	3.01
Qwen2.5-VL-3B	37.17	21.25	12.71	8.04	19.21	25.01	5.26
Qwen2.5-VL-7B	36.15	21.51	13.20	8.65	19.54	25.63	5.07
Qwen3-VL-4B	45.77	27.72	17.41	11.36	23.25	28.92	10.22
Qwen3-VL-8B	40.88	25.64	16.54	10.97	21.73	29.44	10.21
GovLA-Reasoner (ours)	53.32	37.10	26.98	20.31	26.63	37.97	20.31

GovLA-10K 聚焦于 9 类治理相关目标并提供对齐的场景字幕与管理建议。
基于 MMGrounding-DINO 的定位实现了较强的检测性能，凸显了文本引导定位在治理任务中的价值。
GovLA-Reasoner 在主流 VLM 基线上显著提升字幕指标，使用 4B LLM 时 BLEU-1 提升至 53.32、CIDEr-D 提升至 20.31，成为报道中的领先水平。
特征适配器方法实现对 LLM 的端到端视觉条件化，而无需微调探测器或 LLM，从而带来更高的效率与性能。
消融研究显示适配器是必要的，且使用三个输入特征组（F_img、F_query、F_decoder）可获得最好结果；适配器中的两层 Transformer 提供最佳性价比。
GovLA-Reasoner 在若干指标上优于更大规模模型，体现了参数效率与有效的隐式协同。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。