[论文解读] Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks
该论文在分类、检测/分割、OOD 泛化和检索等任务中对多种预训练骨干网络(监督、SSL、视觉-语言与生成式)进行了基准测试,以帮助选择骨干网络。
Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-Backbones
研究动机与目标
- 评估公开可得的广泛骨干网络在多种CV任务和设定中的表现。
- 确定哪些骨干在同领域和跨领域数据上具有更好的泛化能力。
- 为从业者在骨干选择上提供实用指南,并为研究人员指明未来方向。
提出的方法
- 汇集涵盖监督、自监督、视觉-语言和生成范式的多样化预训练骨干网络。
- 在多种协议下评估骨干网络在分类、检测/分割、OOD 泛化和检索任务上的表现(微调、线性探针、端到端、冻结特征)。
- 在公开可获取的检查点上进行同等条件的比较,进行中等超参数搜索。
- 分析跨任务和设定的性能相关性,以识别通用骨干网络及任务特定优势。
- 报告延迟和内存使用情况,兼顾效率与准确性。
实验结果
研究问题
- RQ1哪些预训练骨干网络在广泛的CV任务组合中总体表现最佳?
- RQ2在控制架构和数据规模后,监督、自监督、视觉-语言和生成骨干网络的性能有何差异?
- RQ3在不同下游任务之间,性能和任务可转移性是否相关?
- RQ4在不同约束下(如小型模型、预算或特定任务),骨干选择的实际建议是什么?
主要发现
- 监督型 ConvNeXt-Base 和 SwinV2-Base,以及 CLIP ViT-Base,通常在各任务和设置中表现居前。
- SSL 骨干在与可比预训练数据进行 apples-to-apples 比较时竞争力很高,但用更大数据集进行训练的监督骨干在许多任务中仍占优。
- ViTs 在端到端微调后对密集预测任务比 CNN 更受益,而在线性探针下 CNN 表现更出色。
- 跨任务的性能高度相关,表明通用骨干可以在不同领域实现良好泛化;然而,在检索任务上与分类信号的相关性较低。
- 生成型骨干如 MAE 和 Stable Diffusion 在大多数评估任务中表现落后于监督/SSL 骨干(对 Stable Diffusion 及规模有外推的警告)。
- 小型高效骨干(EfficientNet-B0、RegNetX-400MF、ResNet-18)显示效率往往以牺牲任务性能为代价,某些任务在检测/分割方面更偏好较老的架构。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。