QUICK REVIEW

[论文解读] Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train

Valeriu Codreanu, Damian Podareanu|arXiv (Cornell University)|Nov 12, 2017

Advanced Neural Network Applications参考文献 23被引用 36

一句话总结

该论文提出了一种可扩展、高效率的训练框架，用于在最多104,000个x86核心上使用大批次SGD对ImageNet-1K上的ResNet-50进行训练，仅用28分钟即达到77.5%的top-1准确率，且扩展效率超过90%。该工作引入了新颖的Collapsed Ensemble技术，在不修改模型架构的前提下提升了准确率。

ABSTRACT

For the past 5 years, the ILSVRC competition and the ImageNet dataset have attracted a lot of interest from the Computer Vision community, allowing for state-of-the-art accuracy to grow tremendously. This should be credited to the use of deep artificial neural network designs. As these became more complex, the storage, bandwidth, and compute requirements increased. This means that with a non-distributed approach, even when using the most high-density server available, the training process may take weeks, making it prohibitive. Furthermore, as datasets grow, the representation learning potential of deep networks grows as well by using more complex models. This synchronicity triggers a sharp increase in the computational requirements and motivates us to explore the scaling behaviour on petaflop scale supercomputers. In this paper we will describe the challenges and novel solutions needed in order to train ResNet-50 in this large scale environment. We demonstrate above 90\% scaling efficiency and a training time of 28 minutes using up to 104K x86 cores. This is supported by software tools from Intel's ecosystem. Moreover, we show that with regular 90 - 120 epoch train runs we can achieve a top-1 accuracy as high as 77\% for the unmodified ResNet-50 topology. We also introduce the novel Collapsed Ensemble (CE) technique that allows us to obtain a 77.5\% top-1 accuracy, similar to that of a ResNet-152, while training a unmodified ResNet-50 topology for the same fixed training budget. All ResNet-50 models as well as the scripts needed to replicate them will be posted shortly.

研究动机与目标

在不牺牲准确率的前提下，减少在ImageNet-1K等大规模数据集上训练深度残差网络的时间。
解决大批次SGD训练中通常存在的泛化差距和收敛性问题。
利用英特尔软件堆栈和x86架构，在Petaflop规模的HPC系统上实现高性能、可扩展的训练。
开发在固定训练预算内提升模型准确率的技术，尤其针对大批次训练。
证明通过优化训练策略，仅通过极少的架构改动即可实现最先进的准确率。

提出的方法

在数千个CPU核心上使用数据并行，采用极高的全局批次大小（最高达65,536）。
应用经过修改的批量归一化方法，使其能够适应大本地和全局批次大小，从而稳定训练过程。
实施激进的、线性缩放的学习率调度策略，并辅以渐进式预热，以维持收敛性。
提出Collapsed Ensemble（CE）技术，通过复用单次训练运行中的快照形成集成模型，提升泛化性能。
采用受循环学习率和SGDR调度启发的权重衰减与学习率策略，以增强优化稳定性。
利用英特尔发布的Caffe版本及HPC优化软件堆栈，在英特尔Knight's Landing和Skylake系统上实现高效扩展。

实验结果

研究问题

RQ1大批次SGD训练是否能在不出现典型泛化差距的情况下，在ImageNet-1K上实现高准确率？
RQ2如何将ResNet-50的训练时间缩短至30分钟以内，同时保持或提升top-1准确率？
RQ3哪些训练技术能够在100,000多个x86核心上实现超过90%的高扩展效率？
RQ4仅使用ResNet-50架构和固定训练预算，能否实现77.5%的单模型准确率？
RQ5Collapsed Ensemble技术在ImageNet-1K上相较于标准集成和快照方法，能有多大程度的性能超越？

主要发现

作者通过Collapsed Ensemble技术在单个ResNet-50模型上实现了77.5%的top-1准确率，性能与ResNet-152相当。
在最多104,000个x86核心上，训练时间缩短至28分钟，扩展效率超过90%。
所提出的学习率调度策略和训练技术使模型在75个周期后收敛至76.5%的top-1准确率。
当在ImageNet-1K上使用5模型集成时，Collapsed Ensemble方法优于Huang等人提出的快照集成方法。
该框架在英特尔Knight's Landing和Skylake架构上均表现出强大的扩展效率，预计在MareNostrum 4系统上实现76.5%准确率的训练时间可控制在50分钟以内。
所有模型和训练脚本均已公开发布于IntelCaffe GitHub仓库，以确保可复现性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。