QUICK REVIEW

[論文レビュー] MetaFormer: A Unified Meta Framework for Fine-Grained Recognition

Qishuai Diao, Yi Jiang|arXiv (Cornell University)|Mar 5, 2022

Domain Adaptation and Few-Shot Learning被引用数 27

ひとこと要約

MetaFormer はハイブリッド ConvNet-Transformer バックボーンを用いて視覚情報と多様なメタ情報（地理、属性、テキスト）を融合し、細分類認識を実現。いくつかの FGVC ベンチマークで最先端の結果を達成し、メタ情報の有無にかかわらず強力なベースラインとして機能する。

ABSTRACT

Fine-Grained Visual Classification(FGVC) is the task that requires recognizing the objects belonging to multiple subordinate categories of a super-category. Recent state-of-the-art methods usually design sophisticated learning pipelines to tackle this task. However, visual information alone is often not sufficient to accurately differentiate between fine-grained visual categories. Nowadays, the meta-information (e.g., spatio-temporal prior, attribute, and text description) usually appears along with the images. This inspires us to ask the question: Is it possible to use a unified and simple framework to utilize various meta-information to assist in fine-grained identification? To answer this problem, we explore a unified and strong meta-framework(MetaFormer) for fine-grained visual classification. In practice, MetaFormer provides a simple yet effective approach to address the joint learning of vision and various meta-information. Moreover, MetaFormer also provides a strong baseline for FGVC without bells and whistles. Extensive experiments demonstrate that MetaFormer can effectively use various meta-information to improve the performance of fine-grained recognition. In a fair comparison, MetaFormer can outperform the current SotA approaches with only vision information on the iNaturalist2017 and iNaturalist2018 datasets. Adding meta-information, MetaFormer can exceed the current SotA approaches by 5.9% and 5.3%, respectively. Moreover, MetaFormer can achieve 92.3% and 92.7% on CUB-200-2011 and NABirds, which significantly outperforms the SotA approaches. The source code and pre-trained models are released athttps://github.com/dqshuai/MetaFormer.

研究の動機と目的

FGVC タスクにおいて純粋な視覚情報を超えた複数ソースのメタ情報を活用する必要性を動機づける。
タスク固有の装飾なしで、視覚情報とさまざまなメタ情報を融合できる統一的でシンプルなフレームワークを提案する。
異なるモデルサイズでの大規模事前学習とメタ情報がFGVC性能に与える影響を評価する。
複数のデータセットにわたり、視覚情報だけまたは追加のメタ情報を用いたFGVCの強力なベースラインを提供する。

提案手法

畳み込みが視覚情報を符号化し、トランスフォーマー層が視覚情報とメタ情報を融合するハイブリッドフレームワークを導入する。
地理情報、属性、テキストを含む非線形埋め込みを介してメタ情報を符号化し、Relative Transformer Layers を通じて融合する。
最終予測のために異なる段階からのクラス_tokenを統合する集約層を使用する。
ダウンサンプリングと計算コストを管理するために、オーバーラップするパッチ埋め込みと段階的なネットワーク設計を採用する。
FGVC性能への影響を調べるため、さまざまな事前学習レジーム（ImageNet-1k、ImageNet-21k、iNaturalist）を用いた実験を行う。
メタ情報のモデル利用を可視化し、下流のFGVCタスクに対する事前学習の影響を分析する。

実験結果

リサーチクエスチョン

RQ1統一されたトランスフォーマーベースのフレームワークは、タスク特有の事前知識なしで、視覚情報と多様なメタ情報をFGVCで効果的に融合できるだろうか？
RQ2異なる事前学習レジーム下で、iNaturalist、CUB-200-2011、NABirds のようなデータセットでメタ情報が FGVC の性能にどのような影響を与えるか。
RQ3MetaFormer で SotA を達成するための大規模事前学習の役割は何か。
RQ4MetaFormer はFGVCの視覚情報のみの強力なベースラインを提供し、メタ情報を統合した場合にも頑健であるか。

主な発見

MetaFormer は vision-only 入力で CUB-200-2011 および NABirds で最先端の性能を達成する。
メタ情報の追加により iNaturalist 2017/2018/2021 全体でさらなる改善が得られ、視覚能力が高まるにつれて改善が顕著になる。
より大規模な事前学習モデル（例: ImageNet-21k）を用いると、MetaFormer は CUB-200-2011 で 92.3%、NABirds で 92.7% に達し、以前の SotA 手法を上回る。
iNaturalist 2017/2018 では MetaFormer-1 が ImageNet-1k pre-training で 78.2% と 81.9% を達成し、ImageNet-21k pre-training ではそれより高く（79.4% と 83.2%）。
MetaFormer は FGVC に対してシンプルでありながら強力なベースラインを提供し、メタ情報を単一のトランスフォーマーベースの融合機構を介して柔軟に統合できることを示している。
本研究は、FGVC の性能に対する事前学習の選択の重要な影響を強調しており、ドメイン関連の事前学習（iNaturalist）が時に ImageNet-21k を上回ることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。