QUICK REVIEW

[论文解读] Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Di Wang, Qiming Zhang|arXiv (Cornell University)|Aug 8, 2022

Remote-Sensing Image Classification被引用 38

一句话总结

本文在大型遥感数据集上使用 MAE 预训练普通 Vision Transformer (ViTs) ~100M 参数，并引入旋转变尺度的窗口注意力（RVSA）以适应遥感任务，在 DOTA-V1.0 上实现了最先进的目标检测，同时在分类和分割任务上提供具有竞争力的结果，并提高了数据效率。

ABSTRACT

Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks show the superiority of our model over all state-of-the-art models, achieving 81.24% mAP on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also show competitive performance compared to existing advanced methods. Further experiments show the advantages of our models in terms of computational complexity and data efficiency in transferring.

研究动机与目标

验证在遥感数据上对普通 ViT（约 1 亿参数）进行预训练以完成遥感任务的可行性。
探究在合适的预训练下，非层次结构的普通 ViT 是否能够在遥感任务上达到具有竞争力的性能。
开发 RVSA，使其在处理遥感图像的任意方向和尺度时降低计算量。
评估预训练的普通 ViT 在遥感目标检测、分类和分割任务上的迁移能力、效率和鲁棒性。

提出的方法

在 MillionAID 上以无标签方式用 MAE 对普通 ViT 与 ViTAE 主干进行预训练（约 1 亿参数）。
在微调阶段用旋转变尺度注意力（RVSA）替换全自注意力，以应对具有任意方向的遥感数据。
在学习到的窗口配置中引入旋转角度，使注意力窗口具有方向性、变尺度。
在选定层中用 RVSA（及其变体）替换 MHSA，以形成适用于下游任务的遥感主干。
使用标准遥感框架，在遥感任务上进行训练和评估，包括场景分类（UCM、AID、NWPU）、目标检测（DOTA-V1.0、DIOR-R）和分割。

实验结果

研究问题

RQ1在 RS 数据上用 MAE 预训练的普通 ViT 主干是否能够在没有层次化结构的情况下，对遥感任务取得具有竞争力的结果？
RQ2相较于固定窗口注意力，RVSA 是否提升 ViT 对具有任意方向和尺度的遥感图像中对象的建模能力？
RQ3预训练规模和掩码比例对普通 ViTs 在遥感下游任务的性能影响？
RQ4搭配 RVSA 的普通 ViT 在准确性、效率和迁移性方面与最先进的遥感模型相比如何？

主要发现

在 MillionAID 上用 MAE 对普通 ViT（ViT-B 与 ViTAE-B）进行预训练，微调后在遥感任务上获得具有竞争力的表现。
RVSA 通过使注意力具备旋转、变尺度的窗口，显著提升遥感目标检测，达到 DOTA-V1.0 的 81.24% mAP。
基于 RVSA 的变体在遥感场景分类和分割任务上表现强劲，与现有先进方法相比具有竞争力。
该方法在迁移到遥感任务时展现出计算复杂度和数据效率的优势。
RVSA 的窗口大小为 7 在消融实验中在 DOTA-V1.0 和 DIOR-R 上达到峰值 mAP，显示合适的窗口配置的重要性。
该方法通过使用基于窗口的注意力来处理大尺寸遥感图像，从而降低 FLOPs 和内存需求，同时保持丰富的上下文建模。
该工作将普通 ViT 定位为可行的遥感基础模型主干，具有有效的预训练以及针对遥感特征的专门注意力机制。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。