QUICK REVIEW

[論文レビュー] What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

Jaejun Lee, Raphael Tang|arXiv (Cornell University)|Nov 8, 2019

Topic Modeling参考文献 25被引用数 34

ひとこと要約

本論文は、事前学習済みトランスフォーマーモデル（BERTとRoBERTa）の最終層のうち、ほぼ全性能を達成するのに微調整が必要な層数を調べ、最後の約4分の1の層で多くのタスクで90%の品質を満たせることを示す一方、いくつかの例外もある。

ABSTRACT

Pretrained transformer-based language models have achieved state of the art across countless tasks in natural language processing. These models are highly expressive, comprising at least a hundred million parameters and a dozen layers. Recent evidence suggests that only a few of the final layers need to be fine-tuned for high quality on downstream tasks. Naturally, a subsequent research question is, "how many of the last layers do we need to fine-tune?" In this paper, we precisely answer this question. We examine two recent pretrained language models, BERT and RoBERTa, across standard tasks in textual entailment, semantic similarity, sentiment analysis, and linguistic acceptability. We vary the number of final layers that are fine-tuned, then study the resulting change in task-specific effectiveness. We show that only a fourth of the final layers need to be fine-tuned to achieve 90% of the original quality. Surprisingly, we also find that fine-tuning all layers does not always help.

研究の動機と目的

BERTとRoBERTaの最終層のどれだけを微調整すれば標準的なNLPタスクで高い性能を維持できるかを決定する。
複数のデータセットにわたる微調整層数とタスク性能の関係を定量化する。
微調整する層を少なくするまたは全層を微調整することで最良の結果が得られるタスクを特定する。

提案手法

埋め込みを凍結したまま、初期層を凍結する層の数を増やして、BERTおよびRoBERTaのバリアント（BASEおよびLARGE）を微調整する。
Evaluate on GLUE tasks: CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE.
Adamオプティマイザをバッチサイズ16で使用; 学習率のハイパーパラメータ探索を [1e-5, 5e-5] の範囲でタスク固有に行う。
出力層を除く非出力層をいずれ凍結しない、いくつか凍結する、または全て凍結する場合の性能を比較し、全モデル微調整との相対的な利得を報告する。
より多くの層を解凍するにつれて性能を観察して各層の寄与を分析し、SST-2での収穫逓減と過剰パラメータ化の可能性を特定する。

実験結果

リサーチクエスチョン

RQ1全体モデルの性能の目標分数（例：90%）を達成するには、最終トランスフォーマー層のうちいくつを微調整する必要があるのか？
RQ2BASEとLARGEのモデルサイズに対して、前の層を凍結することはタスク間で安定した性能をもたらすのか？
RQ3すべての層を微調整しなくても性能が向上または劣化するタスクはあるのか？
RQ4解凍する層の数に応じた性能向上の形状はどうなるのか？

主な発見

Only about a fourth of the final layers need to be fine-tuned to achieve around 90% of full-model quality on most tasks.
On SST-2, not fine-tuning all layers can improve quality compared to full fine-tuning.
Fine-tuning all layers does not always help and can yield worse performance on some tasks.
Diminishing returns are observed as more layers are unfrozen; half the network often suffices to approach full performance, with larger models showing similar trends.
Large variants (BASE vs LARGE) show that freezing 12–16 layers can yield consistent gains on certain tasks, suggesting overparameterization in some cases.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。