QUICK REVIEW

[논문 리뷰] Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

Micah Goldblum, Hossein Souri|arXiv (Cornell University)|2023. 10. 30.

Advanced Neural Network Applications인용 수 26

한 줄 요약

본 논문은 분류, 탐지/세그먼트, OOD 일반화, 검색에 걸쳐 다양한 선행 학습 백본을 벤치마크하여 백본 선택을 안내합니다.

ABSTRACT

Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-Backbones

연구 동기 및 목표

다양한 공개 백본이 여러 CV 작업 및 설정에서 어떤 성능을 내는지 평가합니다.
일치하는 도메인 내외 데이터에서 어떤 백본이 일반화되기 가장 잘하는지 식별합니다.
백본 선택에 대한 실무 지침과 연구 방향에 대한 제안을 제공합니다.

제안 방법

감독 학습, 자기지도 학습, 비전-언어, 생성적 패러다임에 걸친 다양한 사전 학습 백본을 구성합니다.
다양한 프로토콜(미세조정, 선형 탐색, 엔드-투-엔드, 고정 특징) 하에서 분류, 탐지/세그먼트, OOD 일반화, 검색에 대해 백본을 평가합니다.
공개 가능한 체크포인트에서 합리적인 하이퍼파라미터 스윕으로 apples-to-apples 비교를 수행합니다.
작업과 설정 전반의 성능 상관관계를 분석하여 일반 백본과 작업별 강점을 식별합니다.
정확도뿐 아니라 효율성을 고려하기 위해 잠재 지연 시간과 메모리 사용량을 보고합니다.

실험 결과

연구 질문

RQ1광범위한 CV 작업군에서 어떤 사전 학습 백본이 전반적으로 최상의 성능을 보이는가?
RQ2아키텍처와 데이터 규모를 통제했을 때 감독 학습, 자체 감독, 비전-언어, 생성 백본 간의 차이는 무엇인가?
RQ3다양한 하류 작업 간 성능과 과제 이전성은 상관 관계가 있는가?
RQ4작은 모델, 예산, 특정 작업과 같은 제약 하에서 백본 선택에 대한 실용적 권고는 무엇인가?

주요 결과

Supervised ConvNeXt-Base 및 SwinV2-Base, 또한 CLIP ViT-Base는 종종 여러 작업과 설정에서 성능 최상위에 위치합니다.
SSL 백본은 비교 가능한 사전 학습 데이터로 apples-to-apples 비교 시 충분히 경쟁력이 높지만, 더 큰 데이터 세트에서 학습된 감독 백본이 여전히 많은 작업에서 우위를 점합니다.
ViT는 밀도 예측 작업에서 CNN보다 엔드투엔드 미세조정의 혜택을 더 많이 받는 반면, CNN은 선형 탐색에서 우수합니다.
작업 간 성능은 강한 상관관계를 보이며 보편적 백본이 다양한 도메인에 잘 일반화할 수 있음을 시사합니다. 다만 검색은 분류 신호와의 상관관계가 낮습니다.
생성 백본인 MAE 및 Stable Diffusion은 대부분의 평가된 작업에서 감독/SSL 백본에 비해 성능이 떨어집니다(Stable Diffusion과 규모에 대한 주의 필요).
작고 효율적인 백본(EfficientNet-B0, RegNetX-400MF, ResNet-18)은 효율성과 작업 성능 사이에 trade-off가 있음을 보여주며, 일부 작업은 탐지/세그먼션에서 구형 아키텍처를 선호합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.