QUICK REVIEW

[论文解读] A Mutual Information Maximization Perspective of Language Representation Learning

Lingpeng Kong, Cyprien de Masson d’Autume|arXiv (Cornell University)|Oct 18, 2019

Topic Modeling参考文献 31被引用 81

一句话总结

本文将词表示学习重新框定为通过 InfoNCE 最大化互信息，并统一 Skip-gram、BERT 和 XLNet，同时引入 InfoWord——一个将 DIM 与 MLM 相结合的自监督目标，以提升下游任务如 GLUE 和 SQuAD 的性能。

ABSTRACT

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

研究动机与目标

激发对词表示学习的统一、信息理论视角。
展示 Skip-gram、BERT 和 XLNet 如何近似互信息最大化目标。
提供一个通用、可扩展的框架来创建新的自监督任务。
展示一个将全局句子层面与局部 n-gram 互信息结合的新目标。

提出的方法

将 Skip-gram、BERT、和 XLNet 作为 InfoNCE 对互信息 (I(A,B)) 的下界实例进行框架化。
使用 f_theta(a,b)=g_psi(b)ᵀg_omega(a) 来对跨视图表示进行评分。
推导 MLM 和基于置换的目标如何适配 InfoNCE 视角。
提出一个新的自监督目标（基于 DIM），在全局句子表示与局部 n-gram 之间最大化互信息。
引入 InfoWord，作为 DIM 项与 masked language modeling 项的加权组合： I_InfoWord = lambda_MLM * I_MLM + lambda_DIM * I_DIM。
演示在 InfoNCE 中负采样如何作为大词汇表 softmax 的高效近似。

实验结果

研究问题

RQ1互信息最大化视角是否能够统一传统与现代的语言表示学习方法？
RQ2在这个框架内可以构建哪些新的自监督任务来改进语言表示？
RQ3将全局句子 DIM 目标与 MLM 结合是否能比标准的 BERT 风格预训练在下游 NLP 任务上获得提升？
RQ4与 BERT 变体相比，所提出的 InfoWord 方法在 GLUE 和 SQuAD 上的表现如何？

主要发现

模型	CoLA	SST-2	MRPC	QQP	MNLI	QNLI	RTE	GLUE
Base BERT	52.1	93.5	88.9	71.2	84.6/83.4	90.5	66.4	78.8
Base BERT-NCE	50.8	93.0	88.6	70.5	83.2/83.0	90.9	65.9	78.2
Base InfoWord	53.3	92.5	88.7	71.0	83.7/82.4	91.4	68.3	78.9
Large BERT	60.5	94.9	89.3	72.1	86.7/85.9	92.7	70.1	81.5
Large BERT-NCE	54.7	93.1	89.5	71.2	85.8/85.0	92.7	72.5	80.6
Large InfoWord	57.5	94.2	90.2	71.3	85.8/84.8	92.6	72.0	81.1

InfoNCE-based framing unifies Skip-gram, BERT, and XLNet as instances of mutual information maximization.
A simple new objective (DIM) enables learning a global sentence representation that aligns with its local n-gram representations.
InfoWord, which combines I_MLM and I_DIM, yields better results than BERT-NCE on GLUE and SQuAD, especially for tasks needing longer-phrase understanding.
Reimplementation variants (BERT-NCE) are competitive with original BERT in some settings and underperform in others due to masking and data presentation differences.
Experiments indicate InfoWord’s advantage is most pronounced with smaller training sets, highlighting pretraining quality’s role when labeled data is scarce.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。