QUICK REVIEW

[论文解读] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Jesse Dodge, Gabriel Ilharco|arXiv (Cornell University)|Feb 15, 2020

Topic Modeling参考文献 29被引用 216

一句话总结

该论文展示了在微调 BERT 时由于权重初始化和数据顺序种子导致的显著方差，并显示了多次尝试带来的增益，提出一种早停方法，并发布了 GLUE 任务的 2,100 次微调运行。

ABSTRACT

Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.

研究动机与目标

理解随机种子如何影响预训练语言模型的微调性能。
量化权重初始化和数据顺序对性能方差的贡献。
评估多次微调试次是否相对于单次试验带来显著增益。
提出一种早停策略，在保持性能的同时降低计算成本。
公开发布微调数据以促进对训练动态的分析。

提出的方法

在四个 GLUE 任务上微调 BERT-large，仅改变控制最终层权重初始化（WI）和数据顺序（DO）的随机种子。
使用标准超参数，将每个任务训练三轮，并报告所有种子组合的验证性能。
通过使用一组种子网格来解耦 WI 与 DO，并将期望的最佳性能作为试验次数的函数来计算，分析方差。
使用 ANOVA 检验最佳与最差 WI/DO 种子在均值性能上是否存在差异。
提出并评估一个简单的早停算法，在训练中途停止不太有潜力的试验以节省计算。
公开发布包含训练损失和验证性能的完整 2,100 次微调运行数据集。

实验结果

研究问题

RQ1微调性能的方差有多少归因于控制权重初始化和数据顺序的随机种子？
RQ2是否有某些 WI 和 DO 种子在不同任务上持续优于其他种子，是否存在跨数据集可泛化的种子？
RQ3就最佳可达到的验证性能而言，在 GLUE 任务上运行多次微调试次有何收益？
RQ4早停策略是否可以在最终性能损失有限的前提下降低计算量？

主要发现

在四个 GLUE 任务上，使用不同种子进行多次微调试次比单次结果带来显著增益。
权重初始化和数据顺序对性能方差的贡献相当，一些种子在跨任务中持续表现更好。
某些种子初始化在多任务上表现良好，暗示存在全局有利的 WI 种子。
早停可以在不同预算下实现相似或改进的期望性能，同时降低计算成本。
使用多次试验所达到的最佳性能在同一模型与设置下，在多项任务上显著超越了此前公开的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。