QUICK REVIEW

[Paper Review] Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Cheolhyoung Lee, Kyunghyun Cho|arXiv (Cornell University)|Sep 25, 2019

Topic Modeling28 references100 citations

TL;DR

Mixout regularization adapts an L2 penalty toward a pretrained model, improving stability and average dev scores when fine-tuning large pretrained language models on small datasets.

ABSTRACT

In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as "mixout", motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE.

Motivation & Objective

Motivate the need for stabilizing finetuning of large pretrained language models on small downstream datasets.
Introduce Mixout as an adaptive regularizer that biases learning toward a pretrained parameter vector.
Provide theoretical justification showing Mixout acts as an L2 regularizer toward the pretrained model.
Empirically evaluate Mixout on MNIST-like settings and on BERT-LARGE finetuning for GLUE tasks to assess stability and performance.
Compare Mixout with dropout and other regularizers across various ablations to understand its benefits.

Proposed method

Define mixout as a random mixture of the current parameters with a pretrained target via a Bernoulli mask.
Show that Mixout corresponds to an adaptive L2 penalty toward the pretrained parameters, with strength controlled by the mask probability p.
Provide theoretical results (Theorem 1 and Corollary 1.1) bounding the expected loss and linking Mixout to an L2 regularization term.
Apply Mixout to pretrained models by replacing dropout with mixout on pretrained layers while keeping the final output layer unregularized.
Conduct empirical validations on synthetic (EMNIST/MNIST) and real-world NLP finetuning (BERT-LARGE on GLUE) settings to demonstrate improved stability and dev scores.

Experimental results

Research questions

RQ1Does Mixout provide a theoretically justified adaptive regularization toward a pretrained parameter vector during finetuning?
RQ2How does Mixout compare to standard dropout and weight decay in terms of finetuning stability and average dev performance on downstream tasks?
RQ3CanMixout reduce degenerate finetuning outcomes and improve robustness across random restarts when fine-tuning large pretrained models on small datasets?
RQ4What is the impact of Mixout on both pretrained layers and non-pretrained output layers during fine-tuning?
RQ5Is Mixout effective across different task types and data regimes (synthetic MNIST-like vs. GLUE tasks)?

Key findings

Mixout acts as an adaptive L2 regularizer toward the pretrained parameters, with strength increasing with the mix probability p.
In MNIST-like experiments, Mixout keeps finetuned weights closer to the pretrained weights than dropout, validating the theoretical claim.
Finetuning BERT-LARGE with Mixout on small GLUE task subsets reduces degenerate, chance-level results and increases average dev scores across tasks.
Across ablations, Mixout improves stability and robustness to hyperparameters (p) compared to dropout, particularly for low-data regimes.
Combining Mixout with weight decay on pretrained weights yields further gains in average and best dev scores on several tasks.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.