QUICK REVIEW

[论文解读] Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning

Beliz Gunel, Jingfei Du|arXiv (Cornell University)|Nov 3, 2020

Topic Modeling参考文献 60被引用 60

一句话总结

本文在预训练语言模型的标准微调目标中加入了监督对比学习项，在少样本 GLUE 表现、对噪声数据的鲁棒性以及对相关任务的泛化方面有所提升，且无需额外数据或架构修改。

ABSTRACT

State-of-the-art natural language understanding classification models follow two-stages: pre-training a large language model on an auxiliary task, and then fine-tuning the model on a task-specific labeled dataset using cross-entropy loss. However, the cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, our proposed SCL loss obtains significant improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in few-shot learning settings, without requiring specialized architecture, data augmentations, memory banks, or additional unsupervised data. Our proposed fine-tuning objective leads to models that are more robust to different levels of noise in the fine-tuning training data, and can generalize better to related tasks with limited labeled data.

研究动机与目标

激发在跨熵损失之外提升微调的泛化性和稳定性。
在微调过程中利用同一类别样本之间的相似性并对比不同类别样本。
开发一个将监督对比损失与分类的交叉熵损失结合在一起的综合损失函数。

提出的方法

提出一个联合损失 L = (1 - λ) L_CE + λ L_SCL，用于多类分类。
L_CE 是对模型输出的标准交叉熵损失。
L_SCL 在编码器空间中将同一类别的样本拉近、将不同类别的样本推远，使用温度 τ 和 L2 归一化的表示。
编码器 Φ(x) 从最后一层隐藏层输出一个 L2 归一化的表示（BERT 等模型的 CLS token）。
按任务调节 λ 与 τ；经验结果在多种设定下偏好 τ = 0.3 和 λ = 0.9。

实验结果

研究问题

RQ1监督对比学习项是否能在数据较少的情景下改善预训练语言模型的微调？
RQ2将 L_SCL 与交叉熵结合是否在微调过程中对噪声标注数据的鲁棒性有提升？
RQ3所提对象是否对单句与句子对的 NLP 任务在 GLUE 的任务中都有益处？
RQ4此方法是否在标注数据有限的情况下提升对相关任务的泛化能力？

主要发现

在少样本设置下，CE+SCL 在 SST-2、QNLI、MNLI 上取得提升，在 N=20 时 QNLI 最高可达 10.7 点。
对于 20、100、1000 个带标注样本，CE+SCL 相对于 CE 表现出稳定的提升，例如在 20 个样本时 MNLI 提升 3.4 点、SST-2 提升 2.2 点；随着数据增多，提升减弱。
CE+SCL 对微调数据中的噪声具有鲁棒性，在较高噪声水平（T=0.7）下 MNLI 的提升可达 7 点，QNLI 在 T=0.9 下提升 4.2 点。
CE+SCL 对标注数据有限的相关任务的泛化能力有所提升，例如对 Amazon-2 相较于仅 CE 提升 2.9 点，且少样本迁移的方差更小。
在整个 GLUE 上，CE+SCL 对 MRPC 提升 3.1 点，对 QNLI 提升 3.5 点，六项任务的平均提升为 1.2 点；较大的批量大小放大了这些提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。