QUICK REVIEW

[논문 리뷰] Dual-stream Network for Visual Recognition

Mingyuan Mao, Renrui Zhang|arXiv (Cornell University)|2021. 05. 31.

Advanced Neural Network Applications참고 문헌 55인용 수 36

한 줄 요약

DS-Net은 고해상도 로컬 디테일과 저해상도 글로벌 패턴을 각각 처리하는 이중 스트림 블록을 도입하고, 스케일 간 정합으로 이들을 융합하여 ImageNet과 MSCOCO에서 강력한 성능을 달성한다.

ABSTRACT

Transformers with remarkable global representation capacities achieve competitive results for visual tasks, but fail to consider high-level local pattern information in input images. In this paper, we present a generic Dual-stream Network (DS-Net) to fully explore the representation capacity of local and global pattern features for image classification. Our DS-Net can simultaneously calculate fine-grained and integrated features and efficiently fuse them. Specifically, we propose an Intra-scale Propagation module to process two different resolutions in each block and an Inter-Scale Alignment module to perform information interaction across features at dual scales. Besides, we also design a Dual-stream FPN (DS-FPN) to further enhance contextual information for downstream dense predictions. Without bells and whistles, the proposed DS-Net outperforms DeiT-Small by 2.4% in terms of top-1 accuracy on ImageNet-1k and achieves state-of-the-art performance over other Vision Transformers and ResNets. For object detection and instance segmentation, DS-Net-Small respectively outperforms ResNet-50 by 6.4% and 5.5% in terms of mAP on MSCOCO 2017, and surpasses the previous state-of-the-art scheme, which significantly demonstrates its potential to be a general backbone in vision tasks. The code will be released soon.

연구 동기 및 목표

비전 모델에서 로컬 및 글로벌 패턴을 공동으로 모델링해야 할 필요성에 대한 동기를 제시한다.
향상된 인식을 위해 이중 스케일 표현을 보존하는 이중 스트림 네트워크(DS-Net)를 제안한다.
Intra-scale Propagation을 설계하여 두 해상도를 처리하고 Inter-scale Alignment를 통해 융합한다.
밀도 예측 작업을 위한 DS-FPN으로 이중 스트림 설계를 Feature Pyramid Networks에 확장한다.

제안 방법

로컬(고해상도) 경로와 글로벌(저해상도) 경로로 특징을 분할하는 Dual-stream Blocks(DS-Blocks)을 도입한다.
로컬 특징은 세밀한 디테일을 포착하기 위해 깊이별 3×3 합성곱으로 처리한다.
전역 특징은 다운샘플링된 토큰 시퀀스에서 자체 주의를 사용해 객체 수준의 관계를 포착한다.
공동 주의를 사용한 Inter-scale Alignment를 적용하여 로컬 및 글로벌 표현을 양방향으로 융합한다.
다중 스케일 컨텍스트 향상을 위해 FPN에 DS-Blocks를 부착하여 DS-FPN을 형성한다.

실험 결과

연구 질문

RQ1이중 스트림 아키텍처가 분류 및 밀도 예측 작업에서 로컬 및 글로벌 시각 정보를 효과적으로 분리할 수 있는가?
RQ2내부 스케일 전파(Intra-scale propagation)와 스케일 간 정합(Inter-scale alignment)이 naively한 융합 방법보다 크로스 스케일 특징 융합을 개선하는가?
RQ3DS-FPN이 표준 FPN과 비교해 객체 탐지 및 인스턴스 분할에서 측정 가능한 이점을 제공하는가?
RQ4로컬/글로벌 특징 비율이 성능에 미치는 영향은 무엇인가?

주요 결과

방법	매개변수 (M)	FLOPs (G)	처리량 (이미지/초)	Top-1	Top-5
DS-Net-T (당사)	9.1	1.6	1199	78.1
DS-Net-T* (당사)	10.5	1.8	1034	79.0	(+6.8)
DS-Net-S (당사)	19.7	3	582	81.9
DS-Net-S* (당사)	23	3.5	510	82.3	(+2.4)
DS-Net-B (당사)	48.8	7.6	387	82.8
DS-Net-B* (당사)	49.3	8.4	335	83.1	(+1.3)

DS-Net은 ImageNet-1k에서 DeiT-Small보다 상위-1 정확도가 2.4% 포인트 높은 성능을 보인다.
DS-Net 변형은 ImageNet 분류에서 Vision Transformer 및 CNN 사이에서 경쟁력 있거나 최첨단 결과를 달성한다.
MSCOCO 2017에서 DS-Net-S*는 RetinaNet(ResNet-50 대비)에서 6.4% APbbox, Mask R-CNN에서 6.1% APbbox의 향상을 보인다.
DS-Net-S*은 인스턴스 분할에서 40.2% AP를 달성해 ResNet-50 및 Swin-T보다 각각 5.5% 및 0.4% 포인트를 초과한다.
DS-Net-T* 및 DS-Net-S*는 비정렬 버전 대비 추가 이득을 보이며 Inter-scale Alignment의 이점을 보여준다.
DS-FPN with DS-Blocks는 객체 탐지 및 인스턴스 분할 모두에서 표준 FPN 대비 mAP를 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.