QUICK REVIEW

[论文解读] Revisiting Few-sample BERT Fine-tuning

Tianyi Zhang, Felix Wu|arXiv (Cornell University)|Jun 10, 2020

Advanced Neural Network Applications参考文献 57被引用 55

一句话总结

本篇论文分析了在 few-sample BERT fine-tuning 中的不稳定性，识别出来自 biased Adam 的偏倚梯度估计、对高层初始化的有害影响，以及固定的迭代次数，并提出如 debiased Adam 和 layer re-initialization 这样的补救措施，重新评估先前的稳定性方法。

ABSTRACT

This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process.

研究动机与目标

理解在 small datasets 上的 few-sample BERT fine-tuning 不稳定性的原因。
评估优化选择、初始化和迭代次数如何影响稳定性和性能。
提出切实可行的补救措施以减少退化运行并改进收敛性。
在纠正后的优化设置下重新评估现有的稳定化方法。

提出的方法

将 standard BERT fine-tuning 与 debiased Adam 对比，以及 biased BERT Adam。
研究在 fine-tuning 期间对顶层预训练层（pooler 和 Transformer blocks）进行重新初始化。
评估训练超过常用的三轮对稳定性和性能的影响。
评估 debiasing 与 re-init 如何与现有稳定化方法（例如 Mixout、weight decay、intermediate-task transfer）相互作用。
分析在 fine-tuning 过程中逐层参数变化（从初始化的 L2 距离）。

实验结果

研究问题

RQ1small datasets 上的 few-sample BERT fine-tuning 不稳定性的原因是什么？
RQ2Adam 的 debiasing 能否减少 degenerate runs 和 across seeds 的方差？
RQ3re-initializing top BERT layers 能否改善收敛与性能？
RQ4training longer than three epochs 对稳定性和结果有何影响？
RQ5现有的稳定化方法在 unbiased optimization 的条件下是否仍有用？

主要发现

Debiased Adam 显著降低了在 small datasets 上的变异性和 degenerate runs。
Re-initializing top pre-trained layers 能加速收敛，且常常提升 mean performance。
Longer training beyond three epochs 在若干数据集上提升稳定性和性能。
Top-layer overspecialization to pre-training 可能阻碍微调，重新初始化通过提供更好的初始化来缓解。
With debiased optimization, the relative gains of several previously proposed stabilization methods diminish。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。