QUICK REVIEW

[论文解读] Once-for-All: Train One Network and Specialize it for Efficient Deployment on Diverse Hardware Platforms

Han Cai, Chuang Gan|arXiv (Cornell University)|Aug 26, 2019

Advanced Neural Network Applications被引用 6

一句话总结

该论文提出Once-for-All（OFA），一种单一神经网络架构，可在无需微调的情况下高效地针对多种硬件平台进行定制。通过将训练与搜索解耦，并采用渐进式剪枝方法，OFA在保持与独立训练模型相同准确率的前提下，生成超过10^19个子网络，在MACs少于600M的条件下实现80.0%的ImageNet top-1准确率，达到SOTA水平，同时将GPU训练时长和二氧化碳排放量降低数个数量级。

ABSTRACT

We address the challenging problem of efficient inference across many devices and resource constraints, especially on edge devices. Conventional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally prohibitive (causing $CO_2$ emission as much as 5 cars' lifetime) thus unscalable. In this work, we propose to train a once-for-all (OFA) network that supports diverse architectural settings by decoupling training and search, to reduce the cost. We can quickly get a specialized sub-network by selecting from the OFA network without additional training. To efficiently train OFA networks, we also propose a novel progressive shrinking algorithm, a generalized pruning method that reduces the model size across many more dimensions than pruning (depth, width, kernel size, and resolution). It can obtain a surprisingly large number of sub-networks ($> 10^{19}$) that can fit different hardware platforms and latency constraints while maintaining the same level of accuracy as training independently. On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4.0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1.5x faster than MobileNetV3, 2.6x faster than EfficientNet w.r.t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission. In particular, OFA achieves a new SOTA 80.0% ImageNet top-1 accuracy under the mobile setting ($<$600M MACs). OFA is the winning solution for the 3rd Low Power Computer Vision Challenge (LPCVC), DSP classification track and the 4th LPCVC, both classification track and detection track. Code and 50 pre-trained models (for many devices & many latency constraints) are released at this https URL.

研究动机与目标

解决为每种边缘设备和延迟约束单独训练神经网络所带来的高计算成本与环境负担。
克服现有神经架构搜索（NAS）与人工架构设计在可扩展性方面的局限，实现在多种硬件平台上的高效部署。
开发一种方法，在显著减少训练时间与碳足迹的前提下，保持高准确率，相比传统方法具有显著优势。
通过单一预训练网络实现快速推理定制，而无需为每种硬件配置重新训练。

提出的方法

提出一种一次训练（OFA）神经网络，通过单一联合训练过程支持多种架构配置（深度、宽度、卷积核大小、输入分辨率）的灵活调整。
引入渐进式剪枝算法——一种广义剪枝方法，可同时在多个维度上减少模型规模，从而实现OFA网络的高效训练。
将训练阶段与架构搜索阶段解耦，使子网络可直接从预训练的OFA网络中选取，无需额外训练。
采用渐进式剪枝调度策略训练OFA网络，逐步降低网络在深度、宽度、卷积核大小和输入分辨率上的容量。
确保从OFA架构衍生出的所有子网络在未进行微调的情况下，仍能保持与从零开始训练的模型相同的准确率。
利用单一大规模训练过程隐式学习大量子网络，从而实现在各种硬件约束下的快速部署。

实验结果

研究问题

RQ1是否可以仅训练一次神经网络，然后在无需微调的情况下高效地针对多种硬件平台进行定制？
RQ2像渐进式剪枝这样的广义剪枝方法，是否能在大幅降低训练成本的同时，保持在海量子网络配置下的高准确率？
RQ3OFA方法是否能在严格移动推理约束（如<600M MACs）下实现SOTA性能，同时显著减少二氧化碳排放？
RQ4在边缘设备上，OFA在准确率、延迟和效率方面与现有NAS及人工设计模型相比表现如何？
RQ5OFA框架在不损害模型准确率的前提下，能够多大程度上扩展以支持广泛多样的硬件与延迟约束？

主要发现

OFA在移动设置下以少于60000万MACs的计算量，实现了80.0%的ImageNet top-1准确率，创下SOTA新纪录。
在ImageNet上，OFA在top-1准确率上相比MobileNetV3最高提升4.0%，同时保持或改善了延迟效率。
在测量的延迟约束下，OFA的推理速度比MobileNetV3快1.5倍，比EfficientNet快2.6倍。
OFA框架生成了超过10^19种不同的子网络，其准确率与独立训练的模型保持一致，实现了广泛的硬件兼容性。
与传统NAS或人工架构搜索相比，OFA将GPU训练时长和二氧化碳排放量降低了数个数量级。
OFA在第三届和第四届低功耗计算机视觉挑战赛（LPCVC）中均获得冠军，涵盖分类与检测两个赛道，充分证明了其在真实场景部署中的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。