QUICK REVIEW

[论文解读] When BERT Plays the Lottery, All Tickets Are Winning

Sai Prasanna, Anna Rogers|arXiv (Cornell University)|May 1, 2020

Topic Modeling参考文献 56被引用 36

一句话总结

本文研究了对微调后的 BERT 的 lottery ticket 假说，显示经过裁剪后良好子网络可以达到与全模型性能相当的水平，而许多子网络即使在裁剪后也仍然具有可训练性。另有发现：良好子网络并不稳定，且可能并不反映清晰的语言学专门化。

ABSTRACT

Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.

研究动机与目标

评估在幅度裁剪下，BERT 微调是否包含可训练的子网络（胜出票据）。
比较幅度基裁剪和结构化裁剪在 BERT 自注意力头和 MLP 上的效果，以 GLUE 任务为基准。
确定最佳子网络是否与语言学上可解释的模式相关，还是任务特定且不稳定。
评估“坏”的子网络是否能够重新训练以获得强性能。

提出的方法

在 9 个 GLUE 任务上微调 BERT-base 小写版本。
应用迭代式幅度裁剪，移除最低幅度权重的 10%，直到开发集性能下降到全模型的 90% 以下。
通过反向传播得到的重要性分数来掩蔽注意力头和 MLP 模块，进行结构化裁剪。
在裁剪后以及用预训练权重重新初始化并重新微调后，测量裁剪子网络的性能。
将裁剪后的子网络与等大小的随机子网络以及基线架构进行比较。
分析“良好”子网络在随机种子下的稳定性，并考察存活头的注意力模式分布。

实验结果

研究问题

RQ1子网络（胜出票据）在裁剪后是否能达到与全模型相当的性能？
RQ2幅度基裁剪和结构化裁剪在保留性能与实现压缩方面有何差异？
RQ3最佳子网络是否与可解释的语言学知识相关，还是与任务特有的启发式方法相关？
RQ4在微调过程中的不同随机初始化下，所识别的良好子网络是否稳定？

主要发现

模型	CoLA	SST-2	MRPC	QQP	STS-B	MNLI	QNLI	RTE	WNLI	平均值
多数类基线	0.00	0.51	0.68	0.63	0.02	0.35	0.51	0.53	0.56	0.42
CBOW	0.46	0.79	0.75	0.75	0.70	0.57	0.62	0.71	0.56	0.61
BILSTM + GloVe	0.17	0.87	0.77	0.85	0.71	0.66	0.77	0.58	0.56	0.68
BILSTM + ELMO	0.44	0.91	0.70	0.88	0.70	0.68	0.71	0.53	0.56	0.68
‘Bad’ subnetwork (s-pruning)	0.40	0.85	0.67	0.81	0.60	0.80	0.76	0.58	0.53	0.67
‘Bad’ subnetwork (m-pruning)	0.24	0.81	0.67	0.77	0.08	0.61	0.60	0.49	0.49	0.51
Random init + random s-pruning	0.00	0.78	0.67	0.78	0.14	0.63	0.59	0.53	0.50	0.52

良好子网络（来自两种裁剪方法）在 GLUE 任务上可达到约 90% 的全模型性能。
结构化裁剪通常比幅度裁剪带来更大的压缩，但两种方法都保留了相当的性能。
即使是最差的子网络（在结构化裁剪下），也能经过微调达到较强的性能，表明许多预训练权重具有广泛的可用性。
良好子网络在随机种子下并不稳定；它们的选择并不与各自注意力头的明确语言学角色一致。
随机选取的 s-裁剪子网络在若干任务上的表现几乎与良好子网络相当，表明许多权重具备跨越可解释语言模式之外的可迁移性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。