QUICK REVIEW

[論文レビュー] Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yuqi Yang, Yuxiao Guo|arXiv (Cornell University)|Apr 14, 2023

Robotics and Sensor-Based Localization被引用数 34

ひとこと要約

Swin3Dは、室内シーン理解のためのメモリ効率の良いスパース自己注意と文脈的相対信号エンコーディングを備えた事前学習済み3Dトランスフォーマーボーンを導入し、大規模な合成Structured3Dデータセットで事前学習し、実データ3Dデータセットでファインチューニングを行う。

ABSTRACT

The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called {\SST}, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets, but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach. The code and models are available at https://github.com/microsoft/Swin3D .

研究の動機と目的

3D室内シーン理解におけるスケーラブルな事前学習済みバックボーンの必要性を動機づける。
スパースボクセル上で線形メモリ複雑性を持つ3D Swinトランスフォーマーボーン（Swin3D）を提案する。
3D自己注意におけるメモリと信号の不規則性の課題に対処する。
大規模な合成Structured3DデータセットでSwin3Dを事前学習し、下流タスクへの汎化を検証する。
ファインチューニング後に3Dセグメンテーションと検出で優れた性能を示す。

提案手法

局所ウィンドウ自己注意を用いたスパースボクセル上で動作する3D Swinトランスフォーマーボーンを設計する。
SoftMax正規化を遅延させることで2次のメモリコストを低減する、メモリ効率の良い自己注意を実装する。
文脈的相対位置エンコーディングを、複数の信号（位置、色、法線）に対してContextual Relative Signal Encoding (cRSE)として一般化する。
5レベルの階層的スパースボクセルグリッドを用いて多層特徴エンコードを行う。
Structured3D上でSwin3D-SとSwin3D-LをSemantic segmentationのために事前学習し、下流データセットでタスク固有のデコーダでファインチューニングする。
ScanNetとS3DISでセグメンテーションと検出の両方を評価し、最新手法と比較する。

実験結果

リサーチクエスチョン

RQ1合成データで学習した事前学習済みの3Dバックボーンは、実世界の3D室内シーン理解タスクに一般化できるか？
RQ2メモリ効率の高い自己注意は、スケーラブルなトレーニングを可能にするより大きな3Dバックボーンを実現するか？
RQ3一般化された文脈的相対信号エンコーディングは、非規則な点信号の性能にどう影響するか？
RQ4ゼロからの学習と比較して、事前学習済み3Dバックボーンはセグメンテーションと検出タスクにどの利点を提供するか？
RQ5Swin3Dは、複数のベンチマーク（ScanNet、S3DIS）で、セグメンテーションと検出の両方に対してどう位置づけられるか？

主な発見

Structured3DでSwin3Dを事前学習すると、下流タスクで最先端手法と比較して優れた性能を示す。
S3DISのArea5および6分割のセグメンテーションで、Swin3DはmIoUを2.3ポイント向上させる。
S3DISのセグメンテーション（6-fold）で、mIoUが2.2ポイント向上。
ScanNetのセグメンテーション（検証）で、mIoUが1.8ポイント向上。
ScanNetの検出で、AP@0.5が1.9ポイント向上。
S3DISの検出で、AP@0.5が8.1ポイント向上。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。