QUICK REVIEW

[論文レビュー] A Billion-scale Foundation Model for Remote Sensing Images

Keumgang Cha, Junghoon Seo|arXiv (Cornell University)|Apr 11, 2023

Advanced Neural Network Applications被引用数 16

ひとこと要約

本論文は、MillionAIDでMAE pretrainedのビジョンTransformerのパラメータ数を増やすと、下流のリモートセンシングタスクへどう影響するかを調べ、十億規模のモデルが回転物体検出とセマンティックセグメンテーションの性能を向上させ、いくつかのベンチマークで最新の状態を達成することを示している。

ABSTRACT

As the potential of foundation models in visual tasks has garnered significant attention, pretraining these models before downstream tasks has become a crucial step. The three key factors in pretraining foundation models are the pretraining method, the size of the pretraining dataset, and the number of model parameters. Recently, research in the remote sensing field has focused primarily on the pretraining method and the size of the dataset, with limited emphasis on the number of model parameters. This paper addresses this gap by examining the effect of increasing the number of model parameters on the performance of foundation models in downstream tasks such as rotated object detection and semantic segmentation. We pretrained foundation models with varying numbers of parameters, including 86M, 605.26M, 1.3B, and 2.4B, to determine whether performance in downstream tasks improved with an increase in parameters. To the best of our knowledge, this is the first billion-scale foundation model in the remote sensing field. Furthermore, we propose an effective method for scaling up and fine-tuning a vision transformer in the remote sensing field. To evaluate general performance in downstream tasks, we employed the DOTA v2.0 and DIOR-R benchmark datasets for rotated object detection, and the Potsdam and LoveDA datasets for semantic segmentation. Experimental results demonstrated that, across all benchmark datasets and downstream tasks, the performance of the foundation models and data efficiency improved as the number of parameters increased. Moreover, our models achieve the state-of-the-art performance on several datasets including DIOR-R, Postdam, and LoveDA.

研究の動機と目的

リモートセンシングファウンデーションモデルに対するモデル規模（パラメータ数）の影響を調査する。
MillionAIDでMAEを用いて事前学習を行い、スケーリング効果を検討する。
回転物体検出とセマンティックセグメンテーションのベンチマークで下流性能を評価する。
高解像度のRSタスクに対する視覚変換器の効果的なスケーリングと微調整を示す。

提案手法

MillionAID上でMAEを用いて、複数のパラメータスケール（86M、605.26M、1.3B、2.4B）のビジョン・トランスフォーマー・バックボーンを事前学習する。
パラメータ影響を検討するため、12層を維持しつつ、隠れ層サイズ、MLPサイズ、ヘッド数、およびパラレル性を調整してViTをスケールアップする。
下流タスクのため、事前学習済みのプレーンViTバックボーンをViTDETへ適応させ、局所的/グローバル注意機構を導入する。
スケールブロック（転置畳み込み、正規化、GELU、プーリング）を用いて、高解像度タスクのための特徴量のアップサンプリング/ダウンサンプリングを行う。
回転物体検出（DOTA v2.0、DIOR-R）とセマンティックセグメンテーション（Potsdam、LoveDA）でファインチューニングを行う。
事前学習とファインチューニングの設定には、MAEによる75%マスクパッチの再構成、400エポックの事前学習、AdamW、活性化チェックポイント付きのfp16を含む。

Figure 2: A brief introduction of self supervised learning, such as contrastive learning, self-distillation and masked image modeling in computer vision. In contrastive learning, the positive pairs of data are brought closer together while the negative pairs are pushed further apart. Self-distillati

実験結果

リサーチクエスチョン

RQ1リモートセンサリング基盤モデルにおいて、モデルパラメータ数を増やすと下流の性能が向上するか？
RQ2リモートセンシングデータセットでMAEを用いて事前学習した十億パラメータのVision Transformerは、回転物体検出とセマンティックセグメンテーションの分野で、小型の対応モデルより優れるか？
RQ3リモートセンシングの局在タスクに有効なアーキテクチャ適応（ViTDET、スケールブロック、パラレル注意機構）はどれか？
RQ4標準的なRSベンチマークで、より大きなパラメータ数によるデータ効率の向上の証拠はあるか？
RQ5十億規模のRSファウンデーションモデルは、DIOR-R、Potsdam、LoveDAで最新（SOTA）結果を達成するか？

主な発見

パラメータ数が増えるほど、すべてのベンチマークと下流タスクで性能が向上する。
億規模（2.4Bパラメータ）のモデルは、DIOR-R、Potsdam、LoveDAを含む複数のRSデータセットで最新の性能を達成する。
並列性と隠れ層/MLPサイズの調整を伴うViTのスケーリングは、回転物体検出やセマンティックセグメンテーションなどの物体局在タスクを効果的にサポートする。
MillionAIDでMAEによる事前学習は、下流RSタスクに対して強力なドメイン内表現を提供し、データ効率の良いファインチューニングを可能にする。
局所/グローバル注意のバランス計算と高解像度RS入力のメモリ使用量を併用したViTDETベースの下流ヘッド。

Figure 3: This figure explains how to effectively increase the number of parameters of the vision transformer, and the two models have substantially the same amount of computation and number of parameters. In the field of natural language processing, multi head self attention and feed forward blocks

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。