QUICK REVIEW

[논문 리뷰] Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Youwei Liang, Chongjian Ge|arXiv (Cornell University)|2022. 02. 16.

Advanced Neural Network Applications인용 수 95

한 줄 요약

EViT는 비전 트랜스포머에서 주의 토큰을 식별하고 비주의 토큰을 학습 중에 융합하여 파라미터를 추가하지 않고 추론 속도를 높이고 같은 비용에서 효율성이나 정확도를 향상시킨다.

ABSTRACT

Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA. Examples include that tokens containing semantically meaningless or distractive image backgrounds do not positively contribute to the ViT predictions. In this work, we propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training. For each forward inference, we identify the attentive image tokens between MHSA and FFN (i.e., feed-forward network) modules, which is guided by the corresponding class token attention. Then, we reorganize image tokens by preserving attentive image tokens and fusing inattentive ones to expedite subsequent MHSA and FFN computations. To this end, our method EViT improves ViTs from two perspectives. First, under the same amount of input image tokens, our method reduces MHSA and FFN computation for efficient inference. For instance, the inference speed of DeiT-S is increased by 50% while its recognition accuracy is decreased by only 0.3% for ImageNet classification. Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images. An example is that we improve the recognition accuracy of DeiT-S by 1% for ImageNet classification at the same computational cost of a vanilla DeiT-S. Meanwhile, our method does not introduce more parameters to ViTs. Experiments on the standard benchmarks show the effectiveness of our method. The code is available at https://github.com/youweiliang/evit

연구 동기 및 목표

MHSA에서 토큰 수준의 중복성을 식별하여 비전 트랜스포머(ViT)의 가속화를 동기화한다.
주의 토큰을 보존하고 비주의 토큰을 융합하는 학습 시 토큰 재구성을 제안한다.
추론 시 추가 파라미터 없이 EViT가 계산량(MHSA 및 FFN)을 줄임을 보여준다.
동일한 계산 예산에서 더 많은 토큰(더 높은 해상도) 입력을 허용함으로써 정확도를 향상시킨다는 것을 시연한다.
올림포드를 사용하여 토큰 관련성을 안내하는 효과를 탐구하고 기존 가속 방법과 비교한다.

제안 방법

MHSA 헤드 전반에 걸쳐 클래스 토큰이 각 이미지 토큰에 얼마나 주의하는지의 평균을 계산한다.
상위-k 주의 토큰을 유지하고 주의하지 않는 토큰은 하나의 융합 토큰으로 융합한다.
주의도를 가중치로 사용하여 가중 평균으로 주의하지 않는 토큰을 융합한다(x_fused = sum_{i in N} a_i x_i).
선정된 층에서 ViT 학습에 토큰 재구성을 도입하고 유지 비율에 대해 코사인 스케줄을 적용한다.
선택적으로 중요한 토큰을 식별하기 위해 올리케 ViT를 사용하여 학습하고 올리케 가중치로 EViT를 초기화한다.
동일한 계산 비용으로 더 많은 토큰을 입력받아 고해상도 학습을 시연하고 ImageNet 실험으로 검증한다.

실험 결과

연구 질문

RQ1ViT 학습 중 토큰 재구성이 추론 비용을 줄이면서 정확도를 유지할 수 있는가?
RQ2주의 않하는 토큰을 융합하는 것이 간단한 토큰 제거에 비해 정보를 더 많이 보존하고 학습을 안정시키는가?
RQ3고정된 계산에서 EViT는 어떤 성능을 보이며 더 높은 입력 해상도일 때는 어떠한가?
RQ4올리케 ViT를 사용하여 토큰 선택을 안내하는 것이 정확도와 효율성에 어떤 영향을 미치는가?

주요 결과

EViT는 ImageNet에서 약 0.3%의 정확도 손실로 DeiT-S 추론을 약 50% 가속할 수 있다.
EViT는 동일 MAC에서 더 높은 처리량을 달성하며 더 높은 해상도 입력을 사용할 경우 정확도(예: DeiT-S가 같은 계산에서 1% 상위-1)를 유지하거나 향상시킬 수 있다.
주의하지 않는 토큰 융합은 정보를 보존하고 토큰 가지치기만으로는 달성하기 어려운 학습 안정성과 정확도를 향상시킨다.
올리케로 학습하면 정확도가 더 향상되며(예: 올리케 설정에서 DeiT-S를 79.8%에서 80.7%로 증가) 계산은 유지되거나 감소한다.
DynamicViT와 비교할 때 EViT는 같은 계산에서 더 적은 파라미터로 더 나은 정확도를 제공하며 더 긴 학습에서 추가 이득을 보인다.
EViT는 DeiT와 LV-ViT 같은 다양한 ViT 변종에 적용 가능하며 설정에 따라 유리한 정확도-처리량 트레이드를 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.