Skip to main content
QUICK REVIEW

[论文解读] A Mutual Information Maximization Perspective of Language Representation Learning

Lingpeng Kong, Cyprien de Masson d’Autume|arXiv (Cornell University)|Oct 18, 2019
Topic Modeling参考文献 31被引用 81
一句话总结

本文将词表示学习重新框定为通过 InfoNCE 最大化互信息,并统一 Skip-gram、BERT 和 XLNet,同时引入 InfoWord——一个将 DIM 与 MLM 相结合的自监督目标,以提升下游任务如 GLUE 和 SQuAD 的性能。

ABSTRACT

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

研究动机与目标

  • 激发对词表示学习的统一、信息理论视角。
  • 展示 Skip-gram、BERT 和 XLNet 如何近似互信息最大化目标。
  • 提供一个通用、可扩展的框架来创建新的自监督任务。
  • 展示一个将全局句子层面与局部 n-gram 互信息结合的新目标。

提出的方法

  • 将 Skip-gram、BERT、和 XLNet 作为 InfoNCE 对互信息 (I(A,B)) 的下界实例进行框架化。
  • 使用 f_theta(a,b)=g_psi(b)ᵀg_omega(a) 来对跨视图表示进行评分。
  • 推导 MLM 和基于置换的目标如何适配 InfoNCE 视角。
  • 提出一个新的自监督目标(基于 DIM),在全局句子表示与局部 n-gram 之间最大化互信息。
  • 引入 InfoWord,作为 DIM 项与 masked language modeling 项的加权组合: I_InfoWord = lambda_MLM * I_MLM + lambda_DIM * I_DIM。
  • 演示在 InfoNCE 中负采样如何作为大词汇表 softmax 的高效近似。

实验结果

研究问题

  • RQ1互信息最大化视角是否能够统一传统与现代的语言表示学习方法?
  • RQ2在这个框架内可以构建哪些新的自监督任务来改进语言表示?
  • RQ3将全局句子 DIM 目标与 MLM 结合是否能比标准的 BERT 风格预训练在下游 NLP 任务上获得提升?
  • RQ4与 BERT 变体相比,所提出的 InfoWord 方法在 GLUE 和 SQuAD 上的表现如何?

主要发现

模型CoLASST-2MRPCQQPMNLIQNLIRTEGLUE
Base BERT52.193.588.971.284.6/83.490.566.478.8
Base BERT-NCE50.893.088.670.583.2/83.090.965.978.2
Base InfoWord53.392.588.771.083.7/82.491.468.378.9
Large BERT60.594.989.372.186.7/85.992.770.181.5
Large BERT-NCE54.793.189.571.285.8/85.092.772.580.6
Large InfoWord57.594.290.271.385.8/84.892.672.081.1
  • InfoNCE-based framing unifies Skip-gram, BERT, and XLNet as instances of mutual information maximization.
  • A simple new objective (DIM) enables learning a global sentence representation that aligns with its local n-gram representations.
  • InfoWord, which combines I_MLM and I_DIM, yields better results than BERT-NCE on GLUE and SQuAD, especially for tasks needing longer-phrase understanding.
  • Reimplementation variants (BERT-NCE) are competitive with original BERT in some settings and underperform in others due to masking and data presentation differences.
  • Experiments indicate InfoWord’s advantage is most pronounced with smaller training sets, highlighting pretraining quality’s role when labeled data is scarce.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。