[论文解读] On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
论文表明,基于BERT的模型微调不稳定性主要源自优化难题(梯度消失)和泛化方差,而非灾难性遗忘或少量数据,并提出一个简单而强大的基线来显著提升稳定性。
Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches. Code to reproduce our results is available online: https://github.com/uds-lsv/bert-stable-fine-tuning.
研究动机与目标
- 调查为何跨种子对BERT基模型进行微调时不稳定。
- 评估常被引用的假设(灾难性遗忘、少量数据)是否是不稳定的原因。
- 将不稳定性分解为优化与泛化两个组成部分。
- 提出一个简单、鲁棒的微调基线,提升稳定性和性能。
提出的方法
- 分析在GLUE任务上BERT、RoBERTa和ALBERT的微调稳定性。
- 检查梯度以识别导致失败的优化问题。
- 评估ADAM中的偏置校正和学习率预热的影响。
- 评估增加训练迭代次数(更长的训练)对稳定性的影响。
- 提出并验证一个结合偏置校正和扩展训练的基线微调设置。
实验结果
研究问题
- RQ1在BERT基模型微调过程中观察到的不稳定性是什么原因?
- RQ2灾难性遗忘和小数据集规模是造成不稳定性的主要元凶吗?
- RQ3优化动力学(如梯度消失)和泛化如何导致不稳定性?
- RQ4一个简单的基线是否可以在不同架构和数据集上改善微调的稳定性?
主要发现
- 不稳定性更能通过优化困难(梯度消失)和后期的泛化方差来解释,而非单纯的灾难性遗忘或小数据本身。
- 失败的运行显示底层层的梯度消失,而成功的运行在整个训练过程中梯度更强。
- Adam中的偏差校正和“预热”效应显著提升稳定性,尤其对BERT和ALBERT;RoBERTa受益但程度较小。
- 增加训练迭代次数并使训练损失趋近于零,将带来更一致的开发性能。
- 使用AdamW、偏置校正、学习率为2e-5、20个epoch的简单基线,在跨种子时的变异性显著降低,且在RTE、MRPC、CoLA上具有有竞争力的均值/最大值表现。
- 这一定量结果也适用于除了BERT之外的 RoBERTa 和 ALBERT。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。