QUICK REVIEW

[논문 리뷰] MetaFormer: A Unified Meta Framework for Fine-Grained Recognition

Qishuai Diao, Yi Jiang|arXiv (Cornell University)|2022. 03. 05.

Domain Adaptation and Few-Shot Learning인용 수 27

한 줄 요약

MetaFormer는 하이브리드 ConvNet-Transformer 백본을 사용하여 시각 정보와 다양한 메타정보(지리, 속성, 텍스트)를 융합하여 미세한 구분 인식에서 최첨단 결과를 달성하고, 메타정보 유무에 관계없이 강력한 기준선을 제공합니다.

ABSTRACT

Fine-Grained Visual Classification(FGVC) is the task that requires recognizing the objects belonging to multiple subordinate categories of a super-category. Recent state-of-the-art methods usually design sophisticated learning pipelines to tackle this task. However, visual information alone is often not sufficient to accurately differentiate between fine-grained visual categories. Nowadays, the meta-information (e.g., spatio-temporal prior, attribute, and text description) usually appears along with the images. This inspires us to ask the question: Is it possible to use a unified and simple framework to utilize various meta-information to assist in fine-grained identification? To answer this problem, we explore a unified and strong meta-framework(MetaFormer) for fine-grained visual classification. In practice, MetaFormer provides a simple yet effective approach to address the joint learning of vision and various meta-information. Moreover, MetaFormer also provides a strong baseline for FGVC without bells and whistles. Extensive experiments demonstrate that MetaFormer can effectively use various meta-information to improve the performance of fine-grained recognition. In a fair comparison, MetaFormer can outperform the current SotA approaches with only vision information on the iNaturalist2017 and iNaturalist2018 datasets. Adding meta-information, MetaFormer can exceed the current SotA approaches by 5.9% and 5.3%, respectively. Moreover, MetaFormer can achieve 92.3% and 92.7% on CUB-200-2011 and NABirds, which significantly outperforms the SotA approaches. The source code and pre-trained models are released athttps://github.com/dqshuai/MetaFormer.

연구 동기 및 목표

FGVC 작업을 위해 순수 비전만으로는 한정된 메타정보를 활용할 필요성에 대한 동기를 제시합니다.
작업별 특수한 부가 장치 없이 시각 정보와 다양한 메타정보를 융합할 수 있는 단일하고 간단한 프레임워크를 제안합니다.
대규모 사전학습과 서로 다른 모델 크기에서의 메타정보가 FGVC 성능에 미치는 영향을 평가합니다.
여러 데이터셋에서 비전 정보만을 사용하는 강력한 베이스라인과 메타정보를 추가한 베이스라인을 제공합니다.]
objective_1
Motivate the need for leveraging multi-source meta-information beyond pure vision for FGVC tasks.
objective_2
Propose a unified, simple framework that can fuse vision and various meta-information without task-specific bells-and-whistles.
objective_3
Assess the impact of large-scale pre-training and meta-information under different model sizes on FGVC performance.
objective_4
Provide strong baselines for FGVC using only vision information and with added meta-information across multiple datasets.

제안 방법

컨볼루션이 비전을 인코딩하고 트랜스포머 레이어가 비전과 메타정보를 융합하는 하이브리드 프레임워크를 도입합니다.
메타정보를 비선형 임베딩(지오로케이션, 속성, 텍스트 포함)으로 인코딩하고 Relative Transformer Layers를 통해 융합합니다.
다른 단계의 클래스 토큰을 집계 계층으로 결합하여 최종 예측을 수행합니다.
다운샘플링과 계산 비용을 관리하기 위해 Overlapping patch embedding과 단계적 네트워크 디자인을 사용합니다.
사전학습 규칙(ImageNet-1k, ImageNet-21k, iNaturalist)을 다양하게 실험하여 FGVC 성능에 미치는 영향을 연구합니다.
메타정보의 모델 사용을 시각화하고 사전학습이 다운스트림 FGVC 작업에 미치는 영향을 분석합니다.]
method_1
Introduce a hybrid framework where convolution encodes vision and transformer layers fuse vision with meta-information.
method_2
Encode meta-information via non-linear embeddings (including geolocation, attributes, and text) and fuse through Relative Transformer Layers.
method_3
Use an aggregate layer to combine class tokens from different stages for final prediction.
method_4
Employ overlapping patch embedding and a staged network design to manage downsampling and computational costs.
method_5
Experiment with various pre-training regimes (ImageNet-1k, ImageNet-21k, iNaturalist) to study their effect on FGVC performance.
method_6
Visualize model usage of meta-information and analyze pre-training impact on downstream FGVC tasks.

실험 결과

연구 질문

RQ1하나의 unified transformer 기반 프레임워크가 태스크별 사전 지식 없이도 비전과 다양한 메타정보를 FGVC에 효과적으로 융합할 수 있는가?
RQ2메타정보가 iNaturalist, CUB-200-2011, NABirds와 같은 데이터셋에서 서로 다른 사전학습 규칙 하에서 FGVC 성능에 어떤 영향을 미치는가?
RQ3메타포머(MetaFormer)로 SotA 결과를 달성하는 데 있어 대규모 사전학습의 역할은 무엇인가?
RQ4메타정보를 추가해도 비전만의 강력한 베이스라인을 제공하고 메타정보를 통합할 때도 견고한가?

주요 결과

메타포머는 비전 입력만으로 CUB-200-2011 및 NABirds에서 최첨단 성능을 달성합니다.
메타정보를 추가하면 iNaturalist 2017/2018/2021에서 추가 이득이 생기며 시각적 능력이 증가함에 따라 개선이 드러납니다.
더 큰 사전학습 모델(ImageNet-21k)에서 MetaFormer는 CUB-200-2011에서 92.3%, NABirds에서 92.7%를 달성하며 이전 SotA 방법을 능가합니다.
iNaturalist 2017/2018에서 MetaFormer-1은 ImageNet-1k 사전학습으로 78.2%와 81.9%를 달성하고, ImageNet-21k 사전학습으로 더 높아져 (79.4%와 83.2%)를 달성합니다.
MetaFormer는 FGVC를 위한 간단하면서도 강력한 베이스라인을 제공하며 메타정보가 단일 트랜스포머 기반 융합 메커니즘을 통해 유연하게 통합될 수 있음을 보여줍니다.
연구는 FGVC 성능에 대한 사전학습 선택의 중요한 영향을 강조하며 도메인 관련 사전학습(iNaturalist)이 때로는 ImageNet-21k를 능가합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.