QUICK REVIEW

[論文レビュー] Perceiver: General Perception with Iterative Attention

Andrew Jaegle, Felix Gimeno|arXiv (Cornell University)|Mar 4, 2021

Neural dynamics and brain function参考文献 91被引用数 128

ひとこと要約

Perceiverは、クロスアテンションボトルネックを小さな潜在配列へ適用し、反復的な潜在自己注意を用いることで非常に大規模なマルチモーダル入力にスケールするTransformerベースのアーキテクチャを提示し、モダリティ固有の事前知識なしに画像、音声、動画、点群で競争力のある結果を達成します。

ABSTRACT

Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.

研究の動機と目的

モダリティ固有の priors を最小化する一般的な知覚アーキテクチャを動機づける。
Attentionを高次元入力を小さな潜在ボトルネックに射影してスケールさせるPerceiverを導入する。
2D畳み込みやドメイン固有の priors なしで多様なモダリティに対して競争力のパフォーマンスを示す。
クロスアテンションと潜在自己注意を反復させることで大規模入力から深い表現を得る方法を示す。

提案手法

高次元入力のバイト列を固定サイズの潜在配列へマッピングするクロスアテンションモジュールを使用する（N << M）。
潜在空間で深いTransformerを用いて潜在配列を処理する（複雑さは約 O(N^2)）。
入力表現を洗練させるためにクロスアテンションブロックと潜在自己注意ブロックを反復的に交互に適用する。
効率性を向上させ、深いアーキテクチャを実現するためにクロスアテンションモジュールと潜在Transformerブロックで重みを共有する。
空間/時間構造を保持するために、各入力要素にスケーラブルなFourier特徴量や学習エンコーディングで位置情報/モダリティ情報を付与する。
情報抽出を改善するために複数のクロスアテンション層を使用することも可能。

実験結果

リサーチクエスチョン

RQ1視覚、音声、動画、点群に対して、モダリティに依存しないTransformerベースのアーキテクチャが競争力のある知覚性能を達成できるか。
RQ2非対称クロスアテンションボトルネックが数万の入力へスケールしつつ精度を維持できるか。
RQ3Fourierベースの位置エンコーディングはモダリティ間での性能と順序耐性にどのように影響するか。
RQ4クロスアテンションの深さと潜在Transformerの深さ間のトレードオフはどうなるか、そして重み共有が効率と精度にどう影響するか。
RQ5PerceiverはImageNet、AudioSet、ModelNet40などで専門アーキテクチャ（例：ResNet-50、ViT）と比べてどう性能を示すか。

主な発見

モデル	入力	Top-1 (ImageNet)
ResNet-50 (FF)	RGB + Fourier features	73.5
ViT-B-16 (FF)	RGB + Fourier features	76.7
Transformer (64x64, FF)	64x64 downsampled inputs	57.0
Perceiver (FF)	Input pixels with Fourier features	78.0
Perceiver (Learned pos.)	Input pixels with learned pos.	70.9

2D畳み込みなしでImageNet Top-1精度に競争力を持つ、入力ピクセル数が50,176の場合。
raw音声、動画、いずれかまたは両方の音声セットで強力な性能を発揮。
ModelNet-40の点群分類でも競争力のある結果を示す。
潜在ボトルネックにより入力サイズと深さを分離して非常に深いモデルを可能にし、総計は O(MN + LN^2)。
クロスアテンションとTransformerブロックの重み共有によりパラメータが約10倍削減され、一般化性能が向上。
Fourier特徴エンコードにより、ハードなアーキテクチャ的 priors なしで空間/時間構造を保持できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。