QUICK REVIEW

[论文解读] BeamVLM for Low-altitude Economy: Generative Beam Prediction via Vision-language Models

Chenran Kou, Changsheng You|arXiv (Cornell University)|Feb 23, 2026

UAV Applications and Optimization被引用 0

一句话总结

BeamVLM 将 UAV 波束预测视为生成式视觉语言任务，使用带提示的预训练视觉语言模型，联合推理 UAV 轨迹与环境以提高准确性和泛化性。

ABSTRACT

For low-altitude economy (LAE), fast and accurate beam prediction between high-mobility unmanned aerial vehicles (UAVs) and ground base stations is of paramount importance, which ensures seamless coverage and reliable communications. However, existing deep learning-based beam prediction methods lack high-level semantic understanding of dynamic environments, resulting in poor generalization. On the other hand, the emerging large language model (LLM) based approaches show promise in enhancing generalization, but they typically lack rich environmental perception, thereby failing to capture fine-grained spatial semantics essential for precise beam alignment. To tackle these limitations, we propose in this correspondence a novel end-to-end generative framework for beam prediction, called BeamVLM, which treats beam prediction as a vision question answering task capitalizing on powerful existing vision-language models (VLMs). By projecting raw visual patches directly into the language domain and judiciously designing an instructional prompt, the proposed BeamVLM enables the VLM to jointly reason over UAV trajectories and environmental context. Last, experimental results on real-world datasets demonstrate that the proposed BeamVLM outperforms state-of-the-art methods in prediction accuracy and also exhibits superior generalization for other scenarios such as vehicle-to-infrastructure (V2I) beam prediction.

研究动机与目标

Motivate fast, accurate beam prediction for high-mobility UAVs in low-altitude economy scenarios.
Address limited generalization of traditional DL-based beam predictors due to weak semantic scene understanding.
Leverage vision-language models to fuse raw visual context with task instructions for robust beam decisions.
Propose an end-to-end generative BeamVLM framework that outputs beam indices as language tokens.
Demonstrate generalization to vehicle-to-infrastructure scenarios beyond UAVs.

提出的方法

Formulate beam prediction as a generative vision-language task using BeamVLM built on Qwen2.5-VL.
Project raw visual patches into language space to enable multimodal reasoning about UAV trajectory and environment.
Use an instructional prompt with dataset definition, task constraints, and contextual priors to guide beam generation.
Employ a Vision Transformer visual encoder and LoRA-based fine-tuning for efficient adaptation.
Train with teacher forcing to minimize cross-entropy between generated tokens and ground-truth beam indices.
Decode generated tokens into beam indices via de-tokenization to the codebook ￟.

实验结果

研究问题

RQ1Can a vision-language model with structured prompts improve beam prediction accuracy over conventional DL methods?
RQ2Does multimodal reasoning over environmental context enhance generalization to new scenarios like V2I?
RQ3What is the impact of prompt design on the accuracy of generated beam sequences?
RQ4Is end-to-end generative BeamVLM scalable with LoRA on large VL models for beam prediction?
RQ5How does BeamVLM perform under UAV and V2I settings compared to baselines?

主要发现

Model	Total Para.	Trainable Para.	Runtime (s)
LSTM	104.4K	104.4K	7.2e-5
BeamLLM	178.3M	53.9M	2.3e-3
BeamVLM (Ours)	3.1B	42.2M	9.5e-2

BeamVLM achieves Top-1 accuracy of 83.3% at t+1 and 71.4% at t+5 in UAV scenarios (BeamVLM outperforming LSTM by 10.8%).
BeamVLM maintains high Top-3 accuracy across horizons, outperforming BeamLLM and LSTM at t+5 (91.9% and 88.5%, respectively).
In V2I generalization, BeamVLM reaches 72.1% Top-1 at t+1 and 52.9% at t+5, outperforming baselines by up to 16.1% (Top-1) and 4% (Top-3).
Ablation shows removing prompt guidance degrades Top-1 accuracy by about 3.6–3.8 percentage points, confirming prompts’ importance.
BeamVLM demonstrates robust generalization and end-to-end generative beam prediction without handcrafted output heads.
BeamVLM incurs higher runtime than some baselines due to stronger multimodal reasoning.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。