QUICK REVIEW

[論文レビュー] LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Ben Graham, Alaaeldin El-Nouby|arXiv (Cornell University)|Apr 2, 2021

Advanced Neural Network Applications参考文献 66被引用数 91

ひとこと要約

LeViT は、速度で競争力を保つ狭い DeiT ブロックとともにピラミッド構造の Vision Transformer を提案する一方、広いブロックと MLP 拡張を減らす設計選択により推論を高速化します。補足資料は詳細なブロックタイミングと注意バイアスの視覚化を提供します。

ABSTRACT

We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https://github.com/facebookresearch/LeViT

研究の動機と目的

ブロック設計とピラミッド構造を再考することで Vision Transformer の推論をより高速化する動機付け。
同等の解像度と計算予算で、LeViT のブロックと DeiT のブロックの実行時間を特性化する。
ピラミッド構造とブロック幅が全体の効率に与える影響を調査する。
LeViT ブロック全体での注意動作を説明するアブレーションと可視化を提供する。

提案手法

14x14 解像度で DeiT-tiny と LeViT-256 のブロック設計を比較し、実行時間を並列に比較する。
総実行時間への LayerNorm、Q/K、V、QK^T、AV、注意投影、MLP の寄与を分析する。
ピラミッド構造の削除と幅の拡張／ブロック調整に関するアブレーションを示し、効率向上を理解する。
相対的なピクセル位置に対して異なるヘッドがどのように注意するかを解釈するため、注意バイアスマップを可視化する。

実験結果

リサーチクエスチョン

RQ1LeViT は、ピラミッド構造/ConvNet に触発された設計で DeiT と同等またはそれより高速な推論を実現するか？
RQ2ピラミッド構造とブロック幅は、実行時間の各成分と全体の効率にどう影響するか？
RQ3MLP の拡張を小さくすることと注意計算が速度に与える影響は？
RQ4注意バイアスの可視化は、LeViT ブロック全体でのヘッドの特化と情報フローについて何を明らかにするか？

主な発見

LeViT-256 は DeiT-tiny にほぼ近い総実行時間を持ち、同じベンチマーク設定で LeViT の総実行時間は約 2365 μs、DeiT-tiny は 2474 μs。
LeViT は QK^T にかかる時間を抑え、その後の AV 乗算により多くの時間を割く。一方、ブロック幅は広くなっている（C=256 対 192）。
MLP の実行時間を、拡張因子を 4 から 2 に半減させて削減し、幅に関連するコストの一部を補う。
注意バイアスの視覚化により、いくつかのヘッドは近接するピクセルに焦点を合わせ、他はステージ間で一様または方向性のあるパターンを示し、多様な注意戦略を示している。
アブレーションは、ピラミッド構造の削除やブロックの幅を広げることが全体性能と FLOP 数に与える影響を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。