QUICK REVIEW

[論文レビュー] Face Transformer for Recognition

Yaoyao Zhong, Weihong Deng|arXiv (Cornell University)|Mar 27, 2021

Face recognition and analysis参考文献 31被引用数 45

ひとこと要約

この論文は顔認識にTransformerモデルを適用することを探求し、相互パッチ情報を捉えるために重複するパッチトークンを導入し、大規模データセットで訓練した場合CNNと競合する結果を示す。

ABSTRACT

Recently there has been a growing interest in Transformer not only in NLP but also in computer vision. We wonder if transformer can be used in face recognition and whether it is better than CNNs. Therefore, we investigate the performance of Transformer models in face recognition. Considering the original Transformer may neglect the inter-patch information, we modify the patch generation process and make the tokens with sliding patches which overlaps with each others. The models are trained on CASIA-WebFace and MS-Celeb-1M databases, and evaluated on several mainstream benchmarks, including LFW, SLLFW, CALFW, CPLFW, TALFW, CFP-FP, AGEDB and IJB-C databases. We demonstrate that Face Transformer models trained on a large-scale database, MS-Celeb-1M, achieve comparable performance as CNN with similar number of parameters and MACs. To facilitate further researches, Face Transformer models and codes are available at https://github.com/zhongyy/Face-Transformer.

研究の動機と目的

顔認識にTransformerアーキテクチャを適用することの実現可能性を調査する。
同程度のパラメータ数とMACsを持つ場合に、Transformerの性能がCNNに匹敵するか、あるいはそれを上回るかを評価する。
パッチの重複がパッチ間情報の捉え方に与える影響を分析する。
大規模な顔データセットで訓練されたTransformerモデルを、標準的なベンチマークで評価する。

提案手法

ViT風のTransformerを適応させ、重なる画像パッチをトークンとして生成する。
パッチをモデル次元Dに写像する可訓練可能な線形射影を使用する。
クラス・トークンを連結し、LayerNormと残差接続を備えた標準的なTransformerエンコーダを適用する。
識別性の高い埋め込みを強化するためCosFace損失で訓練する。
異なる訓練データセットで、ResNet-100および他のViT、T2T-ViTなどのVision Transformersと比較する。
検証のため、出力埋め込みをコサインマージンソフトマックス損失で扱う。

実験結果

リサーチクエスチョン

RQ1大規模データセットで訓練した場合、TransformerモデルはCNNと比較して顔認識で有効になり得るか？
RQ2パッチの重複トークン生成はパッチ間情報の捕捉と認識性能を改善するか？
RQ3MS-Celeb-1Mで訓練されたFace Transformerは、計算量が同程度のCNNベースモデルと比較して主流のベンチマークでどのように性能を示すか？

主な発見

MS-Celeb-1Mで訓練されたFace Transformerモデルは、同様のパラメータ数とMACsを持つCNNと競合する精度を達成する。
重なるパッチトークン（スライディングパッチ）は、非重複のViT variantesより性能を向上させる。
大規模データセットで訓練された場合、TransformerモデルはTALFW等のベンチマークで強力な結果を示す。
オクルージョンが増加すると、Face Transformerの遮蔽耐性はResNet-100を上回らない。
アテンション分析はモデルが顔領域に注意を向けていることを示し、設計の妥当性を裏付ける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。