QUICK REVIEW

[論文レビュー] ConDaFormer: Disassembled Transformer with Local Structure Enhancement for 3D Point Cloud Understanding

Lunhao Duan, Shanshan Zhao|arXiv (Cornell University)|Dec 18, 2023

3D Surveying and Cultural Heritage被引用数 9

ひとこと要約

ConDaFormer は、3D ウィンドウを計算量を削減するために3つの直交する 2D 平面に分解する disassembled window attention 機構を導入し、Depth-wise sparse convolution による局所構造強化を追加して局所ジオメトリを捉え、3D 点群のセグメンテーションと検出を改善します。

ABSTRACT

Transformers have been recently explored for 3D point cloud understanding with impressive progress achieved. A large number of points, over 0.1 million, make the global self-attention infeasible for point cloud data. Thus, most methods propose to apply the transformer in a local region, e.g., spherical or cubic window. However, it still contains a large number of Query-Key pairs, which requires high computational costs. In addition, previous methods usually learn the query, key, and value using a linear projection without modeling the local 3D geometric structure. In this paper, we attempt to reduce the costs and model the local geometry prior by developing a new transformer block, named ConDaFormer. Technically, ConDaFormer disassembles the cubic window into three orthogonal 2D planes, leading to fewer points when modeling the attention in a similar range. The disassembling operation is beneficial to enlarging the range of attention without increasing the computational complexity, but ignores some contexts. To provide a remedy, we develop a local structure enhancement strategy that introduces a depth-wise convolution before and after the attention. This scheme can also capture the local geometric information. Taking advantage of these designs, ConDaFormer captures both long-range contextual information and local priors. The effectiveness is demonstrated by experimental results on several 3D point cloud understanding benchmarks. Code is available at https://github.com/LHDuan/ConDaFormer .

研究の動機と目的

大規模な 3D 点群に対する自己注意の計算コストを、文脈を損なうことなく削減する。
不規則な点群での特徴学習を向上させるために、トランスフォーマーブロックに局所幾何 priors を組み込む。
3D セマンティックセグメンテーションと物体検出のベンチマークで強力な性能を実現する。
下流の 3D ビジョンタスクの効率的なバックボーン設計を提供する。
分解と局所構造強化の貢献を検証するアブレーションを提供する。

提案手法

立方体ウィンドウを三つの直交する 2D 平面（XY、XZ、YZ）に分解して、ポイント数を削減して自己注意を計算する。
平面間で位置情報を保持するための文脈ベースの適応相対位置エンコーディングを適用する。
三つの平面からの注意出力を統合し、線形層（DaFormer）を介して射影する。
前後の注意前後に depth-wise sparse convolution および 1x1x1 conv ブランチを備えた Local Structure Enhancement（LSE）モジュールを導入する。
局所文脈を豊かにし、長距離文脈を局所近傍へ伝播させるため、注意の前後に LSE を配置する。
ダウンサンプリングで四段階のバックボーンを構築し、意味的セマンティックセグメンテーションには U-Net 風のアップサンプリングを、検出には FCAF3D/CAGroup3D バックボーンを使用する。

実験結果

リサーチクエスチョン

RQ13D ウィンドウを三つの直交平面に分解することで、計算量を減らしつつ性能を維持または向上させることができるか？
RQ2注意前後に depth-wise sparse convolution による局所構造強化を追加することで、点群の局所幾何のモデリングが改善されるか？
RQ3ConDaFormer は従来の最先端手法と比較して、3D セマンティックセグメンテーションと物体検出のベンチマークでどの程度の性能を示すか？
RQ4平面間での位置エンベディングを共有し、ウィンドウサイズが性能と効率に与える影響を検証するアブレーションは？

主な発見

ConDaFormer により ScanNet v2 および S3DIS で最先端の mIoU を達成（ScanNet200 での競合的な結果も）。
DaFormer（分解された注意）は立方体ウィンドウ注意に比べて計算を削減し、設定次第で同等または性能が向上。
注意の前後での Local Structure Enhancement（LSE）により改善が得られ、両方のブランチを用いた場合に最良の結果となる。
SUN RGB-D の検出で、ConDaFormer はいくつかのベースラインよりもパラメータ数が少ない状態で競争力のある mAP@0.25 および mAP@0.5 を達成。
アブレーションにより、平面間で位置エンベディングを共有することと、ウィンドウサイズが性能と効iciency に及ぼす影響が示される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。