QUICK REVIEW

[논문 리뷰] Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

Jun Wang, Xiaohan Yu|arXiv (Cornell University)|2021. 07. 06.

Advanced Neural Network Applications참고 문헌 38인용 수 84

한 줄 요약

FFVT는 ViT에 특징 융합 메커니즘과 Mutual Attention Weight Selection (MAWS)을 도입하여 FGVC를 위해 지역/중간/고수준 토큰을 집계하고, 네 가지 FGVC 벤치마크에서 최첨단 성과를 달성합니다.

ABSTRACT

The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via CNN-based approaches.However, these methods enhance the computational complexity and make the modeldominated by the regions containing the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA performance on general image recognition tasks. Theself-attention mechanism aggregates and weights the information from all patches to the classification token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation token in the deep layer pays more attention to the global information, lacking the local and low-level features that are essential for FGVC. In this work, we proposea novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)where we aggregate the important tokens from each transformer layer to compensate thelocal, low-level and middle-level information. We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens without introducing extra param-eters. We verify the effectiveness of FFVT on three benchmarks where FFVT achieves the state-of-the-art performance.

연구 동기 및 목표

FGVC에서 구분 가능한 지역 특징의 필요성을 해결한다.
비전 트랜스포머를 활용하여 CNN 편향 없이 글로벌 및 로컬 정보를 포착한다.
추가 매개변수 없이 레이어 간 정보가 풍부한 informative 토큰을 선택하는 토큰 선택 메커니즘을 개발한다.
다중 레벨 토큰을 융합하여 최종 분류기에 로컬 및 중간/고수준 정보를 더한다.
소형 데이터셋과 초 미세구조 데이터셋을 포함한 네 가지 FGVC 벤치마크에서 검증한다.

제안 방법

이미지를 패치로 분할하고 분류를 위한 클래스 토큰을 사용하는 순수 ViT 백본으로 처리한다.
마지막 트랜스포머 레이어의 입력을 이전 레이어의 선택된 토큰(클래스 토큰 제외)으로 대체하기 위해 입력을 교체하는 Feature Fusion Module을 도입한다.
self-attention 점수에 기반한 융합을 위한 판별 토큰 선정을 위한 Mutual Attention Weight Selection (MAWS)을 제안한다.
분류 토큰 측과 토큰 측 맥락의 어텐션 점수를 정규화하여 상호 주의를 가중치로 계산하고, 각 레이어마다 상위-K 토큰을 선택한다.
각 레이어에서 K개의 로컬/중간/고수준 토큰을 z_local로 집계하고 이를 마지막 레이어 입력과 결합하여 z_ff로 만들어 최종 분류기에 전달한다.
MAWS에서 추가 학습 가능 매개변수를 유지하지 않고, 토큰 선택을 위한 어텐션 기반 신호에 의존한다.

실험 결과

연구 질문

RQ1순수 변환기 아키텍처가 다중 레이어 토큰 융합을 통해 소형 데이터셋과 초미세구조 데이터셋에서 경쟁력 있는 FGVC 성능을 달성할 수 있는가?
RQ2레이어를 넘어 지역 및 중간 토큰을 선택적으로 융합하는 것이 최종 레이어의 클래스 토큰 정보만 사용하는 것에 비해 FGVC 성능을 향상시키는가?
RQ3MAWS 기반 토큰 선택이 학습 가능한 매개변수를 추가하지 않고도 효과적이고 효율적인가?

주요 결과

방법	백본	정확도
ViT	ViT-B_16	90.8
TransFG	ViT-B_16	91.7
FFVT	ViT-B_16	91.6
ViT	ViT-B_16 (Dogs)	90.2
FFVT	ViT-B_16 (Dogs)	91.5
TransFG	ViT-B_16 (Dogs)	92.3
FFVT	ViT-B_16 (Cotton)	57.92
FFVT	ViT-B_16 (Soy.Loc)	44.17

FFVT는 네 가지 FGVC 벤치마크에서 최첨단 결과를 달성하여 많은 CNN 기반 접근법을 능가한다.
CUB-200-2011에서 ViT-B_16으로 FFVT는 91.6% 정확도를 달성했고, 91.7%를 기록한 TransFG에 근접했다.
Stanford Dogs에서 FFVT는 91.5% 정확도로 두 번째로 높은 TransFG보다 0.9% 포인트 앞섰다.
CottonCultivar80에서 FFVT는 57.92% 정확도로 보고된 방법들 중 가장 높다.
SoyCultivarLocal에서 FFVT는 44.17% 정확도로 이전 방법들보다 높다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.