QUICK REVIEW

[論文レビュー] Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?

Runpei Dong, Zekun Qi|arXiv (Cornell University)|Dec 16, 2022

Domain Adaptation and Few-Shot Learning被引用数 20

ひとこと要約

本論文はCross-Modal TeachersとしてのAutoencoders（ACT）を提案する。これは pretrained 2D image または language Transformers を cross-modal teachers として用い、masked point modeling を通じた自己教師付きの3D表現学習を guid く守る。3Dタスク全体で強い一般化を達成する。

ABSTRACT

The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN. Codes have been released at https://github.com/RunpeiDong/ACT.

研究の動機と目的

3Dデータの不足に起因する3D学習へのクロスモーダル知識移転を促進する。
3Dオートエンコードの教師として、事前学習済みの2D画像および言語トランスフォーマを活用する。
3D表現に豊富な意味情報を継承する2段階のトレーニングフレームワークを開発する。
事前学習した知識を保持しつつ、追加の下流データ注釈を回避する。

提案手法

クロスモーダル教師によりガイドされる masked modeling として3D学習を位置づける。
Stage I: prompt tuning を用いて事前学習済み Transformers を3Dオートエンコーダとしてファインチューニングする。
Stage II: 3Dオートエンコーダ（教師）から潜在特徴を、cosine similarity loss を用いた masked point modeling により3D Transformer studentへ蒸留する。
3Dオートエンコーダ内で離散的変分オートエンコーダ（dVAE）トークナイザと FoldingNet に基づく再構成を使用する。
クロスモーダル移動中に事前学習知識を保持するため、prompt embedding を用いた2段階トレーニングを行う。
マスキングベースの蒸留を、統一された masked modeling 目的（negative cosine similarity）で位置づける。

実験結果

リサーチクエスチョン

RQ1事前学習済みの2D画像または言語トランスフォーマが、2D/言語の下流データなしで自己監督型の3D表現学習を改善できるか？
RQ2Transformersを3Dオートエンコードに適用する際、prompt tuningはクロスモーダル知識の保持に役立つか？
RQ3クロスモーダル教師を用いた masked point modeling は3D Transformer に有効か？
RQ4ACT は2D/3D SSL手法と比較して、さまざまな3D下流タスクでどの程度の性能を示すか？

主な発見

ACT は3D下流タスク全般で強い一般化を達成し、ScanObjectNN において顕著な改善を含む。
ScanObjectNN では、ACT が特定の設定で精度を平均+11.9%改善。
ModelNet40 では、ACT が 1k points で Full transfer 下の OA 93.7% に達する。
3Dシーン分割（S3DIS Area 5）で、ACT は mAcc を+2.5%、mIoU を+1.2% 改善。
言語モデル（BERT-base）を跨模態教師として使用しても競争力のある精度を達成可能であり、ACTのモダリティ非依存能力を示している。
Prompt tuning と pretrained model の凍結は Stage I で full-tuning を上回り、より多くの事前学習知識を保持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。