QUICK REVIEW

[論文レビュー] Point Transformer

Hengshuang Zhao, Li Jiang|arXiv (Cornell University)|Dec 16, 2020

Sensor Technology and Measurement Systems被引用数 34

ひとこと要約

論文は、3D点群にローカルベクトル自己注意を適用する Point Transformer 層を導入し、分類と密集予測のバックボーンを構築し、S3DIS、ModelNet40、ShapeNetPart において最先端の結果を達成します。trainable position encoding と vector attention を強調し、巨大規模の3D理解におけるスケーラビリティと精度を実現します。

ABSTRACT

Self-attention networks have revolutionized natural language processing and are making impressive strides in image analysis tasks such as image classification and object detection. Inspired by this success, we investigate the application of self-attention networks to 3D point cloud processing. We design self-attention layers for point clouds and use these to construct self-attention networks for tasks such as semantic scene segmentation, object part segmentation, and object classification. Our Point Transformer design improves upon prior work across domains and tasks. For example, on the challenging S3DIS dataset for large-scale semantic scene segmentation, the Point Transformer attains an mIoU of 70.4% on Area 5, outperforming the strongest prior model by 3.3 absolute percentage points and crossing the 70% mIoU threshold for the first time.

研究の動機と目的

unordered 3D point clouds に対する self-attention の動機づけと適応。
local neighborhoods に対する vector self-attention を備えた Point Transformer 層の開発。
self-attention と pointwise operations のみを用いて、分類と dense prediction のバックボーンを構築。
パフォーマンスを最適化するための position encoding、Neighborhood size、attention form の検討。
S3DIS、ModelNet40、ShapeNetPart で最先端の結果を示す。

提案手法

各点について k 最近傍近傍に対してローカル vector self-attention 演算子を定義。
attention および feature パスの両方に trainable position encoding delta = theta(p_i - p_j) を取り入れる。
コアとなるビルディング単位として residual Point Transformer block を使用。
セグメンテーション用には U-Net スタイルの遷移ダウン/アップモジュールを備えたマルチステージバックボーンを組み立て、分類にはグローバルプーリングパスを用いる。
k、position encoding、attention type に関する ablation study を含む、S3DIS、ModelNet40、ShapeNetPart などの多様な3Dベンチマークを評価。

実験結果

リサーチクエスチョン

RQ1点群の近傍上での局所ベクトル自己注意が、分類とセグメンテーションのタスクで従来の3D点群手法を上回るか。
RQ2近傍サイズ、位置エンコード、注意の形が Point Transformer の性能にどう影響するか。
RQ3前処理を最小限に抑えたトランスフォーマーベースのバックボーンは、大規模なシーンでボクセル/グラフベースの3Dネットワークと競合し得るか。

主な発見

S3DIS の Area 5 で 70.4% mIoU、6-fold クロスバリデーションで 73.5% mIoU を達成し、従来の最先端を上回る。
ModelNet40 で総合精度 93.7%、ShapeNetPart でインスタンス mIoU 86.6% を達成し、いくつかのベースラインを上回る。
Point Transformers はパラメータ数が比較的少なく（4.9M）、KPConv（14.9M）や SparseConv（30.1M）と比べて軽量。
アブレーションは相対位置エンコードとベクトル注意が、ベースラインおよび絶対/なしエンコードと比較して性能を著しく向上させることを示す。
ベクトル注意はスカラー注意や非注意バリアントよりも大幅に優れており、チャンネルごとの調整の利点を強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。