QUICK REVIEW

[論文レビュー] Meta-Transformer: A Unified Framework for Multimodal Learning

Yiyuan Zhang, Kaixiong Gong|arXiv (Cornell University)|Jul 20, 2023

Multimodal Machine Learning Applications被引用数 45

ひとこと要約

Meta-Transformer は、モダリティ非依存のデータトークナイザーとタスク固有のヘッドを用いて 12 種類のモダリティを統一する単一の凍結エンコーダフレームワークを提示し、強力なドメイン横断の結果を伴うペアなしマルチモーダル学習を実現します。

ABSTRACT

Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities ($ extit{e.g.}$ natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a $ extbf{frozen}$ encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at https://github.com/invictus717/MetaTransformer

研究の動機と目的

共通のパラメータ空間の下でマルチモーダル知覚を統合するモダリティ-アグノスティックなフレームワークの設計を動機づける。
モダリティ専用のトークナイザー、共有された凍結エンコーダ、タスク固有のヘッドを備えた Meta-Transformer の提案。
ペアなしデータと画像ベースの事前学習を用いて、12 種類のモダリティに対するフレームワークを実証。
トークナイゼーション、エンコーダ共有、ヘッドがクロスモーダル表現と下流タスクに寄与する方法を分析。

提案手法

多様なデータを共有トークン空間に写像するモダリティ専門のデータ-to-シーケンストークナイザーを導入。
LAION-2B で事前学習され、テキストには CLIP テキストトークナイザーを用いた、モダリティ共有のトランスフォーマーエンコーダを凍結パラメータで使用し、統一されたバックボーンを形成。
バックボーンを凍結したまま、下流タスクへ表現を適応させるタスク固有のヘッドを取り付け。
学習可能な CLS トークンを先頭に追加し、シーケンスを符号化する位置埋め込みを追加。
コアエンコーダを凍結したまま、タスクヘッド（および軽量トークナイザー）を訓練し、ペアなしマルチモーダル学習を可能に。

実験結果

リサーチクエスチョン

RQ1モダリティ固有のトークナイザーと組み合わせた単一の凍結トランスフォーマーボディは、12 種類の多様なモダリティを効果的にエンコードできるか？
RQ2ペアなしのマルチモーダルデータは、テキスト、画像、3D、音声、動画、その他のモダリティにおいて統一された知覚をどの程度サポートできるか？
RQ3埋め込み、トークナイゼーション戦略、および下流ヘッドがどのように相互作用して、モダリティ全体で競争力のある下流パフォーマンスを生み出すか？

主な発見

本フレームワークは共有エンコーダとペアなしデータで 12 種類のモダリティをサポートし、ベンチマーク全体で競争力の結果を達成。
Meta-Transformer-B16_F を CLIP テキストエンコーダと併用したとき、ImageNet-1K でゼロショット画像分類が 69.3% に達し、チューニングにより B16_T で 85.4%、L14_T で 88.1% に改善。
GLUE テキスト理解では、画像で事前学習された凍結モデルが 54.6 (SST-2)、81.1 (MRPC)、66.0 (QQP)、63.4 (MNLI)、56.3 (QNLI) を示す。チューニング後は、それぞれ 81.3、81.8、78.0、70.0、60.3。
画像理解では、Meta-Transformer-L14_T が ImageNet-1K で 88.1% のトップライン精度を達成し、物体検出で 56.3% AP、セマンティックセグメンテーションの mIoU は 55.0% (L14_T)。
赤外線/ハイパースペクトル/ X 線領域では、B16_F で IR Rank-1 73.50% および mAP 65.19%、X線精度 94.1% を達成し、学習可能パラメータは非常に少なく (0.75M)。
点群の結果は 2D バックボーンと競争力の性能を示し、例として Meta-Transformer-B16_F は ModelNet-40 で 93.6% OA、S3DIS Area-5 で 83.5% mIoU、パラメータ数はコンパクト。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。