QUICK REVIEW

[论文解读] Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

Xiaodong Liu, Pengcheng He|arXiv (Cornell University)|Apr 20, 2019

Topic Modeling参考文献 20被引用 161

一句话总结

论文将知识蒸馏应用于多任务深度神经网络（MT-DNN），将集成知识转移到单一模型，在GLUE分数上达到业界最先进水平。

ABSTRACT

This paper explores the use of knowledge distillation to improve a Multi-Task Deep Neural Network (MT-DNN) (Liu et al., 2019) for learning text representations across multiple natural language understanding tasks. Although ensemble learning can improve model performance, serving an ensemble of large DNNs such as MT-DNN can be prohibitively expensive. Here we apply the knowledge distillation method (Hinton et al., 2015) in the multi-task learning setting. For each task, we train an ensemble of different MT-DNNs (teacher) that outperforms any single model, and then train a single MT-DNN (student) via multi-task learning to \emph{distill} knowledge from these ensemble teachers. We show that the distilled MT-DNN significantly outperforms the original MT-DNN on 7 out of 9 GLUE tasks, pushing the GLUE benchmark (single model) to 83.7\% (1.5\% absolute improvement\footnote{ Based on the GLUE leaderboard at https://gluebenchmark.com/leaderboard as of April 1, 2019.}). The code and pre-trained models will be made publicly available at https://github.com/namisan/mt-dnn.

研究动机与目标

在保持高 NLU 性能的同时，降低集成 MT-DNN 的部署成本。
研究知识蒸馏是否能在多任务 setting 将集成泛化能力转移到单一 MT-DNN。
通过将多个任务特定的教师蒸馏到一个学生，展示在 GLUE 上的改进。
展示蒸馏模型在跨任务的鲁棒性，包括没有教师的任务。

提出的方法

为所选任务训练一个 MT-DNN 的集成（教师），以生成软目标。
通过对每个训练样本对集成预测取平均，计算软目标。
使用多任务学习训练单个 MT-DNN（学生），同时利用硬目标和教师的软目标。
对于具有教师的任务，按权重损失选项结合硬目标和软目标。
在蒸馏后，对 GLUE 的每个任务对蒸馏后的 MT-DNN 进行微调。

实验结果

研究问题

RQ1在多任务设置中，从任务特定 MT-DNN 集成的知识蒸馏能否改善单个 MT-DNN？
RQ2蒸馏后的 MT-DNN 是否保留集成教师的收益，并且也能使没有教师的任务受益？
RQ3与基于 BERT 的基线和 vanilla MT-DNN 基线相比，蒸馏对 GLUE 性能有何影响？

主要发现

MT-DNN KD 在 9 个 GLUE 任务中的 7 项上优于 vanilla MT-DNN。
MT-DNN KD 在 GLUE 上达到 83.7% 的分数（单模型），较之前的最佳理论水平提高 1.5%，与 BERT 在 GLUE 基准上相比提高 3.2%（截至 2019 年 4 月 1 日）。
MT-DNN KD 相较于 MT-DNN 在 CoLA 和 RTE 任务中的表现有显著提升。
蒸馏将集成教师的泛化能力转移给学生，学生保留了大多数集成的改进。
即使对于没有教师的任务，MT-DNN KD 也显示出相对于 MT-DNN 的明显提升，并在某些任务上接近集成性能。
消融研究显示 MT-DNN KD 同时有利于教师提供的任务和无辅助的任务，表明有效的知识转移。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。