QUICK REVIEW

[论文解读] Rocket Launching: A Universal and Efficient Framework for Training Well-performing Light Net

Guorui Zhou, Ying Fan|arXiv (Cornell University)|Aug 14, 2017

Machine Learning and Data Classification被引用 40

一句话总结

本文提出 Rocket Launching，一种通用的训练框架，通过使用复杂的“助推网络”（booster network）在训练过程中以提示损失（hint loss）持续指导轻量化“轻量网络”（light net），使轻量网络在极低推理延迟下实现最先进性能。该方法提升了模型泛化能力与推理效率，在基准数据集和工业数据集上均优于现有的蒸馏与压缩技术。

ABSTRACT

Models applied on real time response task, like click-through rate (CTR) prediction model, require high accuracy and rigorous response time. Therefore, top-performing deep models of high depth and complexity are not well suited for these applications with the limitations on the inference time. In order to further improve the neural networks' performance given the time and computational limitations, we propose an approach that exploits a cumbersome net to help train the lightweight net for prediction. We dub the whole process rocket launching, where the cumbersome booster net is used to guide the learning of the target light net throughout the whole training process. We analyze different loss functions aiming at pushing the light net to behave similarly to the booster net, and adopt the loss with best performance in our experiments. We use one technique called gradient block to improve the performance of the light net and booster net further. Experiments on benchmark datasets and real-life industrial advertisement data present that our light model can get performance only previously achievable with more complex models.

研究动机与目标

解决在严格延迟约束下将高精度深度神经网络部署于实时工业应用的挑战。
通过在训练过程中从复杂助推网络持续获取监督信号，克服现有知识蒸馏与模型压缩方法的局限性。
开发一种通用的、与网络架构无关的框架，在不增加推理时间的前提下提升轻量网络性能。
通过利用更深、更复杂的助推网络提供的分层特征表示，提升小型网络的泛化能力与鲁棒性。

提出的方法

在相同任务上联合训练轻量化‘轻量网络’与更深、更复杂的‘助推网络’，通过共享底层权重实现低级特征的迁移。
引入提示损失函数，促使轻量网络的中间激活与助推网络的激活相匹配，实现在训练过程中的知识迁移。
采用梯度阻断技术，防止提示损失反向传播至助推网络，从而保留其基于真实标签进行优化的能力。
在轻量网络与助推网络之间使用共享的嵌入或特征提取主干网络，确保低级表征学习的一致性。
使用标准深度学习优化器（如 Adam）对整个系统进行优化，配合学习率调度与正则化（如 Dropout）以防止过拟合。
在推理阶段仅部署训练好的轻量网络，保持低延迟的同时实现接近完整助推网络的性能。

实验结果

研究问题

RQ1当轻量神经网络在来自助推网络的持续监督下进行训练时，是否能够实现与更深、更复杂的模型相当的性能？
RQ2提示损失函数的选择如何影响知识迁移效率与最终模型的准确率？
RQ3梯度阻断机制在不损害知识迁移的前提下，能在多大程度上提升助推网络的性能？
RQ4Rocket Launching 框架是否可普遍适用于不同网络架构与数据集，包括工业规模的广告数据？
RQ5将 Rocket Launching 与其他压缩或蒸馏技术结合是否能带来进一步的性能提升？

主要发现

在 SVHN 数据集上，Rocket Launching 相较基线模型实现了 1.29% 的相对误差降低，测试误差从 3.58% 降至 2.20%。
在 CIFAR-100 数据集上，该方法将测试误差从基线模型的 43.7% 降低至 33.0%，相对提升达 10.4%，优于其他蒸馏方法。
在工业规模的广告预测任务中，轻量网络在与基线模型相同的推理延迟下，GAUC 提升了 0.3%（从 0.632 提升至 0.635）。
仅使用助推网络时，其离线指标最高（GAUC 0.637），但每次推理需 23.2 ms，无法满足在线应用需求。
将 Rocket Launching 与知识蒸馏结合（rocket+KD）可进一步提升性能，表明其与现有蒸馏技术具有良好的兼容性。
梯度阻断机制有效防止了助推网络性能下降，使其在保持高性能的同时仍能有效指导轻量网络。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。