[论文解读] On Neural Architecture Search for Resource-Constrained Hardware Platforms
本论文提出一种硬件-软件协同设计的 NAS 框架,联合搜索神经架构、量化和 FPGA 硬件映射,以在严格资源约束下满足要求,在硬件限制下相比单独搜索具有更高的精度。
In the recent past, the success of Neural Architecture Search (NAS) has enabled researchers to broadly explore the design space using learning-based methods. Apart from finding better neural network architectures, the idea of automation has also inspired to improve their implementations on hardware. While some practices of hardware machine-learning automation have achieved remarkable performance, the traditional design concept is still followed: a network architecture is first structured with excellent test accuracy, and then compressed and optimized to fit into a target platform. Such a design flow will easily lead to inferior local-optimal solutions. To address this problem, we propose a new framework to jointly explore the space of neural architecture, hardware implementation, and quantization. Our objective is to find a quantized architecture with the highest accuracy that is implementable on given hardware specifications. We employ FPGAs to implement and test our designs with limited loop-up tables (LUTs) and required throughput. Compared to the separate design/searching methods, our framework has demonstrated much better performance under strict specifications and generated designs of higher accuracy by 18\% to 68\% in the task of classifying CIFAR10 images. With 30,000 LUTs, a light-weight design is found to achieve 82.98\% accuracy and 1293 images/second throughput, compared to which, under the same constraints, the traditional method even fails to find a valid solution.
研究动机与目标
- 激发在资源约束下联合优化架构与硬件的 NAS 需求。
- 提出一个框架,联合搜索神经架构、量化方案和 FPGA 硬件映射。
- 证明联合搜索在硬件限制下比传统的分离方法具有更高的精度。
提出的方法
- 使用强化学习控制器探索架构和量化空间。
- 引入具有动态规划前沿剪枝的硬件空间搜索以满足 LUT 与吞吐量约束。
- 将量化建模为激活的无符号定点和权重的有符号定点,具备可训练的位宽。
- 采用两阶段评估:快速硬件可行性检查,若可行则进行训练/验证。
- 在 FPGA(Altera Cyclone IV)上实现端到端 CNN 加速器设计,时钟为 100 MHz。
实验结果
研究问题
- RQ1联合探索架构、量化和硬件映射是否能在固定硬件约束下产生可行设计,且其性能超越分离的 NAS 与量化搜索?
- RQ2量化与硬件约束如何相互作用,影响 CIFAR-10 上的可达到精度?
- RQ3从框架层面能获得哪些好处(如准确性与硬件指标的帕累托前沿)来自协同设计 NAS?
主要发现
- 联合架构-量化-硬件 搜索在资源约束下比分离搜索方法在 CIFAR-10 实验中具有更高的精度。
- 在 LUT 和吞吐量约束下,30000 LUT 的设计实现了 82.98% 的准确率和 1293 帧/秒。
- 在 10 万 LUT 以下的若干设计达到接近 90% 的准确率(如无量化时 89.71%,在某些情况下经过量化可达 90.30%)。
- 仅量化的搜索在吞吐量要求严格时会显著降低准确性,而联合搜索能够恢复稳健的性能。
- 该框架使用基于动态规划的前沿方法来剪裁硬件空间探索,使跨层的可扩展搜索成为可能。
- 最佳的联合设计在与某些基线架构相比时,显示出具有竞争力的准确性并显著降低了硬件资源需求。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。