QUICK REVIEW

[论文解读] Pretraining on Non-linguistic Structure as a Tool for Analyzing Learning Bias in Language Models

Isabel Papadimitriou, Dan Jurafsky|arXiv (Cornell University)|Apr 30, 2020

Natural Language Processing Techniques参考文献 24被引用 9

一句话总结

本文提出一种迁移学习方法，通过在音乐和Java代码等非语言结构化数据上预训练，研究神经语言模型如何编码语法结构。研究发现，即使训练数据中存在极少的结构化归纳，也能显著提升对人类语言的零样本迁移性能，且语言间的句法相似性强烈预测迁移表现，揭示了内部表征的类型学一致性。

ABSTRACT

We propose a novel methodology for analyzing the encoding of grammatical structure in neural language models through transfer learning. We test how a language model can leverage its internal representations to transfer knowledge across languages and symbol systems. We train LSTMs on non-linguistic, structured data and test their performance on human language to assess which kinds of data induce generalizable encodings that LSTMs can use for natural language. We find that models trained on structured data such as music and Java code have internal representations that help in modelling human language, and that, surprisingly, adding minimal amounts of structure to the training data makes a large difference in transfer to natural language. Further experiments on transfer between human languages show that zero-shot performance on a test language is highly correlated with syntactic similarity to the training language, even after removing any vocabulary overlap. This suggests that the internal representations induced from natural languages are typologically coherent: they encode the features and differences outlined in typological studies. Our results provide insights into how neural networks represent linguistic structure, and also about the kinds of structural biases that give learners the ability to model language.

研究动机与目标

通过迁移学习研究神经语言模型如何编码语法结构。
评估在非语言结构化数据（如音乐、代码）上预训练是否能提升对人类语言的泛化能力。
考察结构化归纳在塑造语言建模归纳偏好中的作用。
评估语言模型的内部表征是否反映类型学语言特征。
确定语言间句法相似性在多大程度上可预测零样本迁移性能。

提出的方法

在非语言结构化数据（包括乐谱和Java源代码）上预训练LSTM，以诱导结构化表征。
在人类语言数据集上微调预训练模型，以评估迁移性能。
测量在句法相似性不同但无词汇重叠的语言对之间的零样本迁移性能。
使用句法类型学度量方法量化训练语言与测试语言之间的结构相似性。
比较不同类型结构化数据的性能，以评估哪种数据能诱导更具泛化能力的表征。
分析内部表征，以确定其是否编码了类型学上有意义的语言特征。

实验结果

研究问题

RQ1在非语言结构化数据上预训练是否能提升语言模型对人类语言的泛化能力？
RQ2训练数据的结构内容在多大程度上影响向人类语言的迁移性能？
RQ3在无词汇重叠的情况下，语言间的零样本迁移性能是否与句法相似性相关？
RQ4语言模型是否学习到反映已知语言类型学的类型学一致的表征？
RQ5训练数据中何种结构化偏置能为语言建模带来最有效的归纳偏好？

主要发现

在音乐和Java代码等结构化数据上预训练的模型在人类语言上表现出显著的迁移性能，表明结构化归纳可增强泛化能力。
即使训练数据中结构内容极少，也能显著提升向人类语言的零样本迁移性能。
在测试语言上的零样本性能与训练语言的句法相似性高度相关，即使完全去除词汇重叠后仍成立。
语言模型的内部表征具有类型学一致性，编码了语言类型学研究中描述的特征与差异。
结果表明神经网络学习到的结构化偏置与人类观察到的语言类型学相一致。
在非语言结构上预训练是探测和分析神经语言模型归纳偏好的有效方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。