QUICK REVIEW

[论文解读] ConvBERT: Improving BERT with Span-based Dynamic Convolution

Zihang Jiang, Weihao Yu|arXiv (Cornell University)|Aug 6, 2020

Topic Modeling参考文献 69被引用 118

一句话总结

ConvBERT 引入基于 span 的动态卷积以替换冗余的注意力头，形成带瓶颈和分组前馈的混合注意力块，在 GLUE/SQuAD 上表现更好且相比 BERT 有更低的预训练成本。

ABSTRACT

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while using less than 1/4 training cost. Code and pre-trained models will be released.

研究动机与目标

通过利用局部依赖性来激励并减少 BERT 自注意力头的冗余。
引入基于 span 的动态卷积以高效捕捉局部上下文。
构建包含混合注意力块、瓶颈注意力和分组前馈的 ConvBERT，以提升效率和性能。
在 GLUE 与 SQuAD 上评估 ConvBERT，以证明在较低的训练成本下的准确性提升。

提出的方法

提出基于 span 的动态卷积，其从局部输入 span 生成内核，条件依赖于 Q 与局部 K_s。
将自注意力和基于 span 的动态卷积组合成混合注意力块，使用相同的 Q 但采用不同的 keys/bases。
引入瓶颈结构以降低自注意力路径和头的维度。
在前馈模块中应用分组线性算子以减少参数和计算量。
以类似 ELECTRA 的替换 token 检测的预训练设置训练 ConvBERT，并在 GLUE 与 SQuAD 上评估。

实验结果

研究问题

RQ1基于 span 的动态卷积是否比标准自注意力更高效地捕捉局部依赖？
RQ2将基于 span 的动态卷积与自注意力结合是否可以减少冗余并提升下游任务性能？
RQ3在类似或更低的训练成本下，使用 ConvBERT 相对于 BERT 和 ELECTRA，在 GLUE 与 SQuAD 基准中的收益是多少？

主要发现

Model	Train FLOPs	Params	MNLI	QNLI	QQP	RTE	SST-2	MRPC	CoLA	STS-B	Avg.
Conv BERTbase	1.9e19 (15x)	106M	85.3	92.4	89.6	74.7	95.0	88.2	66.0	88.2	84.9
Conv BERTbase (train longer)	7.6e19 (59x)	106M	88.3	93.2	90.0	77.9	95.7	88.3	67.8	89.7	86.4

ConvBERT 在 GLUE 上超过同等规模的 BERT 和 ELECTRA 基线，且预训练成本更低。
基础版大小的 ConvBERT 达到 86.4 的 GLUE 分数，比 ELECTRAbase 高 0.7，且训练成本不到其四分之一。
基于 span 的动态卷积在相较于纯动态卷积和并行常规卷积时取得显著提升。
瓶颈注意力和分组前馈有助于降低参数量，同时维持或提升性能。
ConvBERT 小型/基础模型在 FLOPs/参数方面相对基线模型更具优势，同时保持或提升任务性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。