QUICK REVIEW

[论文解读] MapViT: A Two-Stage ViT-Based Framework for Real-Time Radio Quality Map Prediction in Dynamic Environments

Cyril Shih-Huan Hsu, Xi Li|arXiv (Cornell University)|Jan 22, 2026

Wireless Signal Modulation Classification被引用 0

一句话总结

MapViT 引入一个两阶段 Vision Transformer 框架，在动态环境中预测环境变化及由此产生的射频质量地图，实现实时推理与通过几何基础模型的高效数据迁移。

ABSTRACT

Recent advancements in mobile and wireless networks are unlocking the full potential of robotic autonomy, enabling robots to take advantage of ultra-low latency, high data throughput, and ubiquitous connectivity. However, for robots to navigate and operate seamlessly, efficiently and reliably, they must have an accurate understanding of both their surrounding environment and the quality of radio signals. Achieving this in highly dynamic and ever-changing environments remains a challenging and largely unsolved problem. In this paper, we introduce MapViT, a two-stage Vision Transformer (ViT)-based framework inspired by the success of pre-train and fine-tune paradigm for Large Language Models (LLMs). MapViT is designed to predict both environmental changes and expected radio signal quality. We evaluate the framework using a set of representative Machine Learning (ML) models, analyzing their respective strengths and limitations across different scenarios. Experimental results demonstrate that the proposed two-stage pipeline enables real-time prediction, with the ViT-based implementation achieving a strong balance between accuracy and computational efficiency. This makes MapViT a promising solution for energy- and resource-constrained platforms such as mobile robots. Moreover, the geometry foundation model derived from the self-supervised pre-training stage improves data efficiency and transferability, enabling effective downstream predictions even with limited labeled data. Overall, this work lays the foundation for next-generation digital twin ecosystems, and it paves the way for a new class of ML foundation models driving multi-modal intelligence in future 6G-enabled systems.

研究动机与目标

通过预测环境变化及相关射频质量地图来解决机器人系统中的动态射频传播问题。
借鉴大语言模型预训练与微调的两阶段训练范式。
从深度图序列中开发一个几何基础模型以提升数据效率和迁移能力。
在资源受限的平台上评估 ViT 相较于 CNN 与 MLP 的准确性与运行时表现。

提出的方法

阶段 1：对未标注深度图进行自监督预训练，以通过基于 ViT 的编码器的编码器-解码器学习环境动态与深度分布。
阶段 2：有监督微调，其中阶段 1 的编码器初始化射频质量地图预测器，使其能从深度图输入输出阶段 2 的 RQMaps。
深度图通过将 3D SLAM 点云投影到二维平面来捕捉几何信息。
生成性增强通过对几何变体进行评估的光线追踪器来丰富训练数据。
两阶段训练将几何学习与射频传播建模解耦，以提升数据效率和泛化能力。

实验结果

研究问题

RQ1在动态场景中，基于 ViT 的两阶段框架是否能准确预测环境变化和 RQMaps？
RQ2ViT 与 CNN、MLP 在 CPU/GPU 上的预测准确性和运行时表现有何差异？
RQ3几何自监督（GFM）是否提升数据效率并有助于下游几何派生任务的迁移？

主要发现

Table I: 环境变化在 PSNR 预测中的阶段 1 表现（PSNR dB）	Table II: 面积的阶段 2 PSNR（dB）的 RQMaps	Table III: 基于几何派生的地图与下游迁移（射频、照明、温度）
Model	PSNR (dB)
ViT	29.01
CNN	27.53
MLP	24.06
Stage 2 (Area 1) ViT	25.12
Stage 2 (Area 2) ViT	25.11
Stage 2 (Area 3) ViT	22.38
Stage 2 (Area 4) ViT	29.48
Stage 2 (Area 5) ViT	21.87
Stage 2 (Global) ViT	21.00
Stage 2 (Area 1) CNN	22.41
Stage 2 (Area 2) CNN	23.28
Stage 2 (Area 3) CNN	21.29
Stage 2 (Area 4) CNN	26.63
Stage 2 (Area 5) CNN	21.32
Stage 2 (Global) CNN	20.36
Stage 2 (Area 1) MLP	22.52
Stage 2 (Area 2) MLP	22.12
Stage 2 (Area 3) MLP	20.98
Stage 2 (Area 4) MLP	25.58
Stage 2 (Area 5) MLP	20.39
Stage 2 (Global) MLP	19.69
Stage 2 (Area 1) ViT*	21.70
Stage 2 (Area 2) ViT*	19.24
Stage 2 (Area 3) ViT*	17.19
Stage 2 (Area 4) ViT*	24.68
Stage 2 (Area 5) ViT*	15.93
Stage 2 (Global) ViT*	14.50
Stage 2 (Area 1) CNN*	21.70
Stage 2 (Area 2) CNN*	16.95
Stage 2 (Area 3) CNN*	16.03
Stage 2 (Area 4) CNN*	22.88
Stage 2 (Area 5) CNN*	13.90
Stage 2 (Global) CNN*	14.06
Table III (Geometry tasks)	Radio \| Illumination \| Temperature

使用 ViT 主干的 MapViT 在阶段 1 的深度图预测中比 CNN 与 MLP 获得更高的 PSNR（PSNR：ViT 29.01 dB，CNN 27.53 dB，MLP 24.06 dB）。
阶段 2 的 RQMap 预测显示 ViT 在 CPU 和 GPU 平台上提供最佳的精度-效率折中，具有实时推理 (~1 ms) 而光线追踪器需秒到分钟级别。
ViT 在多个仓库区域提供更优的区域 PSNR，优于 CNN 与 MLP，并且在分布外数据上也保持优势（ViT* vs CNN*）。
阶段 1 的预训练（GFM）提高下游几何任务的数据效率，在标注样本更少时仍获得更高的 PSNR 且收敛更快。
两阶段训练在减少标注工作量与计算成本的同时，实现了可重复使用的基础模型，适用于下游多模态任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。