[论文解读] Latency-Aware Differentiable Neural Architecture Search
本文提出了一种延迟感知的可微神经架构搜索(LA-DARTS),将可学习的延迟预测模块(LPM)集成到DARTS框架中,以联合优化准确率和推理延迟。通过在10万种采样架构上训练多层回归器,LPM能够以低于10%的相对误差预测延迟,使搜索在保持CIFAR-10和ImageNet准确率的同时,将延迟降低20%,适用于GPU和CPU平台。
Differentiable neural architecture search methods became popular in recent years, mainly due to their low search costs and flexibility in designing the search space. However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware. This paper deals with this problem by adding a differentiable latency loss term into optimization, so that the search process can tradeoff between accuracy and latency with a balancing coefficient. The core of latency prediction is to encode each network architecture and feed it into a multi-layer regressor, with the training data which can be easily collected from randomly sampling a number of architectures and evaluating them on the hardware. We evaluate our approach on NVIDIA Tesla-P100 GPUs. With 100K sampled architectures (requiring a few hours), the latency prediction module arrives at a relative error of lower than 10%. Equipped with this module, the search method can reduce the latency by 20% meanwhile preserving the accuracy. Our approach also enjoys the ability of being transplanted to a wide range of hardware platforms with very few efforts, or being used to optimizing other non-differentiable factors such as power consumption.
研究动机与目标
- 解决可微NAS方法在生成硬件低效模型方面的问题,这些模型在推理时速度较慢。
- 实现在复杂搜索空间(如DARTS)中端到端的可微准确率与延迟联合优化。
- 开发一种硬件自适应的延迟预测模块(LPM),可轻松迁移至不同设备,且仅需极少微调。
- 证明该方法可在标准基准测试上实现显著延迟降低,同时不损失准确率。
提出的方法
- 训练一个可微的延迟预测模块(LPM),作为多层神经网络,用于预测给定架构的推理延迟。
- LPM在来自DARTS搜索空间的10万种随机采样架构的数据集上进行训练,真实延迟在目标硬件(如NVIDIA Tesla-P100)上测量。
- 架构表示被编码为固定长度的架构参数向量,作为LPM的输入。
- 通过平衡系数λ将LPM集成到DARTS损失函数中,实现准确率与延迟的联合优化。
- 搜索过程采用基于梯度更新的可微架构搜索框架,损失函数同时包含准确率和预测延迟项。
- 通过在CPU延迟数据上重新训练LPM,将方法移植到CPU,实现设备特定的架构搜索。
实验结果
研究问题
- RQ1能否有效训练一个可微延迟预测模块,以在复杂且非链式结构的搜索空间中预测推理延迟?
- RQ2通过可微损失联合优化准确率与延迟,能否在不降低准确率的前提下提升硬件效率?
- RQ3LPM在不同硬件平台(如GPU和CPU)之间的可迁移性如何?
- RQ4该方法在保持标准基准测试上竞争力准确率的同时,能在多大程度上降低延迟?
主要发现
- 在GPU和CPU上,LPM的相对误差均低于5%,CPU上的绝对误差为8.27ms,相对误差为5.32%。
- 在CIFAR-10上,LA-DARTS相比原始DARTS将延迟降低了19%,同时保持了相近的准确率(2.57%的测试误差)。
- 在ImageNet上,LA-DARTS实现了25.1%的top-1错误率,且在CPU上的延迟比基线低30%(114.1ms vs. 164.1ms)。
- 在GPU上找到的架构在CPU上表现不佳,GPU与CPU延迟的排名一致性仅为69%,凸显了硬件特定搜索的必要性。
- LA-PC-DARTS-B在ImageNet上将CPU延迟降低了30%,且准确率未下降,证明了强大的硬件感知优化能力。
- 预测延迟与实际延迟之间的Kendall-τ相关系数在GPU上为0.83,在CPU上为0.75,表明其在架构搜索中具有高度可靠的预测能力。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。