QUICK REVIEW

[논문 리뷰] Container: Context Aggregation Network

Peng Gao, Jiasen Lu|arXiv (Cornell University)|2021. 06. 02.

Data Stream Mining Techniques참고 문헌 69인용 수 41

한 줄 요약

Container는 정적 및 동적 친화 행렬을 통해 CNN, Transformer, 및 MLP 패러다임을 하나로 통합하여 다중 헤드 맥락 집합화를 수행하고, 효율적인 학습으로 강력한 이미지 분류 성능과 경쟁력 있는 다운스트림 태스크 성능을 달성합니다.

ABSTRACT

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions \emph{a la} Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework. Code is released at \url{https://github.com/allenai/container}.

연구 동기 및 목표

CNN, Transformer, 및 MLP 아키텍처를 맥락 집합화 변형으로서의 통합된 관점으로 제시한다.
정적 및 동적 친화 행렬을 혼합하여 효율적인 장거리 맥락을 제공하는 Container 빌딩 블록을 도입한다.
ImageNet, 객체 탐지, 인스턴스 분할, 자기지도 학습에서 Container와 Container-Light의 성능을 시연한다.
순수 Transformer 백본에 비해 수렴 속도 및 데이터 효율성 이점을 보인다.

제안 방법

이웃 관계를 포착하는 친화 행렬 A를 포함하는 일반적인 맥락 집합화 프레임워크를 정의한다.
Transformer, depthwise convolution, 및 MLP-Mixer가 서로 다른 친화 행렬로 특별한 경우로 어떻게 맞물리는지 보여준다.
학습 가능한 계수(alpha, beta)로 동적(A(X))와 정적(A) 친화력의 학습 가능한 혼합으로 Container를 도입한다.
높은 해상도의 다운스트림 태스크를 위해 초기 단계에서 동적 친화력을 끄는 Container-Light를 제공한다.
패치 임베딩과 블록당 두 개의 하위 모듈(공간 집계 및 채널 융합)을 갖춘 4단계 기본 아키텍처를 기술한다.
ImageNet, 객체 탐지(RetinaNet, Mask R-CNN, DETR), 자기지도 학습(DINO)에서 평가한다.

실험 결과

연구 질문

RQ1단일화된 친화 기반 맥락 집합화 블록이 비전 태스크 전반에서 CNN/Transformer/MLP 백본을 재현하거나 능가할 수 있는가?
RQ2정적 및 동적 친화 행렬의 결합이 각각을 단독으로 사용할 때보다 우수한 성능과 수렴을 가져오는가?
RQ3최첨단 백본과 비교했을 때 분류 및 고해상도 다운스트림 태스크에서 Container와 Container-Light의 성능은 어느가인가?
RQ4제안된 프레임워크에서 어떤 데이터를 효율성 및 수렴 이점이 발생하는가?
RQ5층 간에 학습된 정적 친화에서 어떤 정성적 패턴이 나타나는가?

주요 결과

계열	네트워크	Top-1 정확도	매개변수	FLOPs	처리량	입력 차원	NAS
Container	Container	82.7	22.1 M	8.1 G	347.8	224^2	✗
Container-Light	Container-Light	82.0	20.0 M	3.2 G	1156.9	224^2	✗

Container는 22M 매개변수로 ImageNet에서 Top-1 82.7%를 달성하며 DeiT-S보다 2.8포인트 앞선다.
Container는 200에폭에서 79.9% Top-1에 수렴하며 DeiT-S의 300에폭보다 더 빠르다.
Container-Light는 강력한 다운스트림 성능을 가능하게 하며, 예: RetinaNet 43.8 mAP, Mask-RCNN 45.1 mAP(박스) 및 41.3 mAP(마스크)로 ResNet-50 수준의 계산으로 달성.
Container-Light는 DETR 및 SMCA-DETR 변형에서 ResNet-50 베이스라인보다 향상(예: DETR-Container-Light에서 38.9 mAP).
자기지도 학습(DINO)에서 Container-Light는 학습 에폭 전반에 걸쳐 kNN 정확도에서 DeiT를 능가한다(예: 100에폭에서 71.5 대 69.6).
정적 친화 확장(Container-Pam)은 작지만 일관된 이점을 제공하며, 초기 층에서 로컬리티가 합성되어 합성곱과 흡사하게 나타난다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.