[论文解读] A Mutual Information Maximization Perspective of Language Representation Learning
本文将词表示学习重新框定为通过 InfoNCE 最大化互信息,并统一 Skip-gram、BERT 和 XLNet,同时引入 InfoWord——一个将 DIM 与 MLM 相结合的自监督目标,以提升下游任务如 GLUE 和 SQuAD 的性能。
We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).
研究动机与目标
- 激发对词表示学习的统一、信息理论视角。
- 展示 Skip-gram、BERT 和 XLNet 如何近似互信息最大化目标。
- 提供一个通用、可扩展的框架来创建新的自监督任务。
- 展示一个将全局句子层面与局部 n-gram 互信息结合的新目标。
提出的方法
- 将 Skip-gram、BERT、和 XLNet 作为 InfoNCE 对互信息 (I(A,B)) 的下界实例进行框架化。
- 使用 f_theta(a,b)=g_psi(b)ᵀg_omega(a) 来对跨视图表示进行评分。
- 推导 MLM 和基于置换的目标如何适配 InfoNCE 视角。
- 提出一个新的自监督目标(基于 DIM),在全局句子表示与局部 n-gram 之间最大化互信息。
- 引入 InfoWord,作为 DIM 项与 masked language modeling 项的加权组合: I_InfoWord = lambda_MLM * I_MLM + lambda_DIM * I_DIM。
- 演示在 InfoNCE 中负采样如何作为大词汇表 softmax 的高效近似。
实验结果
研究问题
- RQ1互信息最大化视角是否能够统一传统与现代的语言表示学习方法?
- RQ2在这个框架内可以构建哪些新的自监督任务来改进语言表示?
- RQ3将全局句子 DIM 目标与 MLM 结合是否能比标准的 BERT 风格预训练在下游 NLP 任务上获得提升?
- RQ4与 BERT 变体相比,所提出的 InfoWord 方法在 GLUE 和 SQuAD 上的表现如何?
主要发现
| 模型 | CoLA | SST-2 | MRPC | QQP | MNLI | QNLI | RTE | GLUE |
|---|---|---|---|---|---|---|---|---|
| Base BERT | 52.1 | 93.5 | 88.9 | 71.2 | 84.6/83.4 | 90.5 | 66.4 | 78.8 |
| Base BERT-NCE | 50.8 | 93.0 | 88.6 | 70.5 | 83.2/83.0 | 90.9 | 65.9 | 78.2 |
| Base InfoWord | 53.3 | 92.5 | 88.7 | 71.0 | 83.7/82.4 | 91.4 | 68.3 | 78.9 |
| Large BERT | 60.5 | 94.9 | 89.3 | 72.1 | 86.7/85.9 | 92.7 | 70.1 | 81.5 |
| Large BERT-NCE | 54.7 | 93.1 | 89.5 | 71.2 | 85.8/85.0 | 92.7 | 72.5 | 80.6 |
| Large InfoWord | 57.5 | 94.2 | 90.2 | 71.3 | 85.8/84.8 | 92.6 | 72.0 | 81.1 |
- InfoNCE-based framing unifies Skip-gram, BERT, and XLNet as instances of mutual information maximization.
- A simple new objective (DIM) enables learning a global sentence representation that aligns with its local n-gram representations.
- InfoWord, which combines I_MLM and I_DIM, yields better results than BERT-NCE on GLUE and SQuAD, especially for tasks needing longer-phrase understanding.
- Reimplementation variants (BERT-NCE) are competitive with original BERT in some settings and underperform in others due to masking and data presentation differences.
- Experiments indicate InfoWord’s advantage is most pronounced with smaller training sets, highlighting pretraining quality’s role when labeled data is scarce.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。