QUICK REVIEW

[论文解读] How Language-Neutral is Multilingual BERT?

Jindřich Libovický, Rudolf Rosa|arXiv (Cornell University)|Nov 8, 2019

Topic Modeling参考文献 17被引用 76

一句话总结

论文表明 mBERT 同时包含语言特定和语言中性成分；中心化有助于检索和对齐的语言中性化，而有监督的线性投影可以显著提升跨语言检索，但 MT 质量评估仍然具有挑战性。

ABSTRACT

Multilingual BERT (mBERT) provides sentence representations for 104 languages, which are useful for many multi-lingual tasks. Previous work probed the cross-linguality of mBERT using zero-shot transfer learning on morphological and syntactic tasks. We instead focus on the semantic properties of mBERT. We show that mBERT representations can be split into a language-specific component and a language-neutral component, and that the language-neutral component is sufficiently general in terms of modeling semantics to allow high-accuracy word-alignment and sentence retrieval but is not yet good enough for the more difficult task of MT quality estimation. Our work presents interesting challenges which must be solved to build better language-neutral representations, particularly for tasks requiring linguistic transfer of semantics.

研究动机与目标

评估 mBERT 的语义跨语言属性，超越零-shot 形态/句法迁移。
将 mBERT 的句子表征分解为语言特定和语言中性成分。
通过句子检索、单词对齐和 MT 质量评估来评估语言中性性。
研究提升语言中性性的方法（中心化、投影、定向微调、对抗性去除）。

提出的方法

通过从句子表征中减去语言质心来中心化语言特定信息。
通过跨层任务探针表征，包括语言识别、语言相似性、平行句检索、单词对齐和 MT 质量评估。
使用少量平行数据在英文空间上评估线性投影。
在检索和对齐任务中比较非中心化、中心化和基于投影的表示。
使用 UDify 进行微调并结合对抗性的 lng-free 设置来测试对语言中性性的影响。

实验结果

研究问题

RQ1多语言 BERT 在跨 104 种语言的语义任务中在多大程度上语言中性？
RQ2中心化或线性投影是否能够产生对跨语言检索和对齐有用的语言不可知表示？
RQ3针对多语言句法/形态的微调或对抗性去除语言身份如何影响语义跨语言性？
RQ4哪些任务最能反映语义跨语言迁移，当前表示在哪些方面失败（如 MT 质量评估）？

主要发现

中心化表示降低语言识别准确性，表明去除了语言特定信号。
语言质心在很大程度上按语言家族分组，显示出部分语言相似性。
中心化显著提升跨语言句子检索；小规模有监督投影进一步提升准确性，接近完美检索。
使用 mBERT 表示的单词对齐在若干语言对上超过 FastAlign，且在中心化方面基本不受影响。
MT 质量评估与非中心化或基于投影的距离相关性较弱，监督回归表现最好；单独的中心化不足以用于 QE。
微调（UDify）不会移除语言身份，可能降低语义跨语言性；对抗性语言移除（lng-free）可以抑制语言信号而不损害其他任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。