QUICK REVIEW

[논문 리뷰] Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Luke Melas-Kyriazi|arXiv (Cornell University)|2021. 05. 06.

Advanced Neural Network Applications참고 문헌 11인용 수 80

한 줄 요약

논문은 Vision Transformer의 주의(attention)을 패치 차원에 걸친 피드포워드 계층으로 대체하고, FF-만 모델이 ImageNet 상위 1% 정확도에서 강력한 성능을 달성할 수 있음을 발견하여, 주의가 경쟁력 있는 성능에 필수적이지 않을 수 있음을 시사한다.

ABSTRACT

The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed-forward layers applied over the patch and feature dimensions in an alternating fashion. In experiments on ImageNet, this architecture performs surprisingly well: a ViT/DeiT-base-sized model obtains 74.9\% top-1 accuracy, compared to 77.9\% and 79.9\% for ViT and DeiT respectively. These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought. We hope these results prompt the community to spend more time trying to understand why our current models are as effective as they are.

연구 동기 및 목표

ImageNet에서 Vision Transformer 성능에 대한 주의가 필수적인지 확인한다.
주의를 갖춘 ViT/DeiT와 비교해 피드포워드-만 아키텍처가 얼마나 다른지 평가한다.
비전 트랜스포머의 강력한 성능에 가장 크게 기여하는 구성 요소가 무엇인지 이해한다.

제안 방법

ViT의 주의(attention) 계층을 패치 차원에 적용된 피드포워드 계층으로 대체한다.
공정한 비교를 위해 ViT/DeiT 기본선과 동일한 구조와 학습 방식 사용.
ImageNet에서 224px 해상도로 ViT/DeiT tiny, base 및 large 구성으로 학습한다.
크기에 따라 FF-만 네트워크와 주의 기반 대안의 성능을 비교한다.

실험 결과

연구 질문

RQ1주의 메커니즘을 제거하고 패치에 대한 피드포워드 계층을 사용하는 것이 ImageNet top-1 정확도에 어떤 영향을 미치는가?
RQ2강력한 성능을 주도하는 구성 요소는 무엇인가? (패치 임베딩, 학습 증강 등)
RQ3표준 ViT/DeiT 크기에서 피드포워드-만 아키텍처가 경쟁력 있는 결과를 얻을 수 있는가?

주요 결과

FF-만 모델은 강력한 정확도를 달성하며, 예를 들어 베이스 크기의 FF-만은 ImageNet에서 상단 1%가 74.9%에 도달한다.
주의 없이 작동하는 모델은 주의 기반 모델보다 성능이 낮지만 크기에 관계없이 여전히 놀랍도록 강력하다.
베이스 크기의 FF-만 모델은 tiny FF-만보다 현저하게 더 정확하지만 주의가 있는 ViT/DeiT를 앞지르지 못한다.
연구 설정에서 대형 FF-만 모델은 베이스/ViT에 비해 성능이 저하된다.
순수 주의-전용 모델(작은 규모)은 이 설정에서 성능이 저조하여 FF 구성 없이 주의의 이점이 제한적임을 강조한다.
관찰된 성능은 주의 메커니즘뿐 아니라 학습 규칙과 패치 임베딩이 기여한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.