Skip to main content
QUICK REVIEW

[论文解读] Once-for-All: Train One Network and Specialize it for Efficient Deployment

Han Cai, Chuang Gan|arXiv (Cornell University)|Aug 26, 2019
Advanced Neural Network Applications参考文献 38被引用 679
一句话总结

OFA 训练一个单一的灵活网络,能够针对多种硬件专门化为多种子网络,无需重新训练,在不同设备上实现高效部署的同时保持准确性。

ABSTRACT

We address the challenging problem of efficient inference across many devices and resource constraints, especially on edge devices. Conventional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally prohibitive (causing $CO_2$ emission as much as 5 cars' lifetime) thus unscalable. In this work, we propose to train a once-for-all (OFA) network that supports diverse architectural settings by decoupling training and search, to reduce the cost. We can quickly get a specialized sub-network by selecting from the OFA network without additional training. To efficiently train OFA networks, we also propose a novel progressive shrinking algorithm, a generalized pruning method that reduces the model size across many more dimensions than pruning (depth, width, kernel size, and resolution). It can obtain a surprisingly large number of sub-networks ($> 10^{19}$) that can fit different hardware platforms and latency constraints while maintaining the same level of accuracy as training independently. On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4.0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1.5x faster than MobileNetV3, 2.6x faster than EfficientNet w.r.t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission. In particular, OFA achieves a new SOTA 80.0% ImageNet top-1 accuracy under the mobile setting ($<$600M MACs). OFA is the winning solution for the 3rd Low Power Computer Vision Challenge (LPCVC), DSP classification track and the 4th LPCVC, both classification track and detection track. Code and 50 pre-trained models (for many devices & many latency constraints) are released at https://github.com/mit-han-lab/once-for-all.

研究动机与目标

  • 在多样化硬件上实现高效部署的需求动机,尽量减少重新训练和成本。
  • 引入一个单一的 Once-for-All 网络,支持多种体系结构配置(深度、宽度、卷积核大小、分辨率)。
  • 提出一种训练方案,在每个部署场景不需重新训练的情况下获得准确的子网络。

提出的方法

  • 定义具有深度、宽度、卷积核大小和分辨率的弹性体系结构空间,以映射到子网络。
  • 对大型 OFA 网络进行渐进收缩训练,以在共享权重的同时逐步支持更小的子网络。
  • 使用知识蒸馏来稳定跨嵌套子网络的训练。
  • 在专用阶段,构建神经网络双胞胎(准确率预测器和延迟查询表)以指导针对每个硬件约束的最佳子网络的进化搜索。
  • 将训练与搜索解耦,以在跨部署场景下将成本从 O(N) 降至 O(1)。

实验结果

研究问题

  • RQ1单个 OFA 网络是否能够在保持与独立训练网络相当的准确性下,支持大量子网络(>10^19)?
  • RQ2渐进收缩在联合训练中是否能有效减轻子网络之间的干扰?
  • RQ3以预测器引导的搜索(神经网络双胞胎)是否能够以微不足道的成本高效识别适用于多样化硬件的最优子网络?
  • RQ4在云端和边缘设备上,OFA 与最先进的硬件感知 NAS 方法在准确性、延迟和能耗方面的表现如何?

主要发现

  • OFA 在多种硬件平台上实现了优越的准确性-延迟权衡,相较于 SOTA 硬件感知 NAS 方法。
  • 在 ImageNet 移动设置(<600M MACs),OFA 达到 80.0% 的 top-1 准确率,使用 595M MACs,成为新的移动端 SOTA。
  • OFA 相较于 NAS 方法,在支持多种部署场景时,将训练和设计成本降低了数量级,并降低了 CO2 排放。
  • 渐进收缩使得一个极大的子网络空间(>10^19 架构)能够高效训练,同时保持与独立训练子网络相当的准确性。
  • 在不同设备(CPU、GPU、FPGA、移动端)的专用 OFA 子网络,在相似延迟下持续超越 MobileNetV2/MnasNet/等方法,同时对新硬件几乎不需要额外训练。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。