QUICK REVIEW

[论文解读] Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis

Kelly Zhang, Samuel R. Bowman|arXiv (Cornell University)|Sep 26, 2018

Topic Modeling参考文献 22被引用 41

一句话总结

该论文比较了四种预训练目标（语言建模、翻译、跳跃思维skip-thought、自编码），并显示双向语言模型在词性标注和CCG超标签标注任务上提供最强的句法表示，常常优于翻译编码器，甚至在辅助任务数据充足时可与随机初始化的 LSTM 相媲美。

ABSTRACT

Recent work using auxiliary prediction task classifiers to investigate the properties of LSTM representations has begun to shed light on why pretrained representations, like ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017), are so beneficial for neural language understanding models. We still, though, do not yet have a clear understanding of how the choice of pretraining objective affects the type of linguistic information that models learn. With this in mind, we compare four objectives---language modeling, translation, skip-thought, and autoencoding---on their ability to induce syntactic and part-of-speech information. We make a fair comparison between the tasks by holding constant the quantity and genre of the training data, as well as the LSTM architecture. We find that representations from language models consistently perform best on our syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data. These results suggest that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information. We also find that the representations from randomly-initialized, frozen LSTMs perform strikingly well on our syntactic auxiliary tasks, but this effect disappears when the amount of training data for the auxiliary tasks is reduced.

研究动机与目标

激发理解预训练目标如何塑造学习到的语言表征。
通过控制数据来源、数据量和模型结构，公平比较预训练任务。
使用用于 POS 标注和 CCG 超标签标注的辅助分类器来评估预训练表征中的句法知识。
考察训练数据量与随机性对学习到的表征的影响。

提出的方法

在英德翻译数据和单语数据上，对四个目标训练多种基于 LSTM 的模型：语言建模（LM）、翻译、skip-thought 和自编码。
通过将前向和后向 LM 隐藏状态连接为标记表示，使用双向 LM（BiLM）表示。
冻结预训练编码器，训练辅助分类器（多层感知机 MLP）用于 POS 标注和 CCG 超标签标注，以探测隐藏状态中的句法信息。
与未训练（随机初始化）LSTM 和 WC-MFC 基线进行比较，以分离学习到的与记忆化的信息。
改变训练数据量（1M、5M、15M、63M 句子）和分类器数据比例（1%、10%、100%），以研究数据效应。
通过使用 WSJ/PTB 和 CCG Bank 数据集进行 POS 与 CCG 标注来控制数据领域。

实验结果

研究问题

RQ1训练任务（LM、翻译、skip-thought、自编码）如何影响句法信息的编码？
RQ2训练数据量是否影响预训练表征支持句法辅助任务的能力？
RQ3当辅助分类器获得充足数据时，随机初始化的编码器是否能够支持句法标注？
RQ4层次和体系结构的选择如何影响隐藏表示中捕获的句法信息？
RQ5双向上下文（BiLMs）在句法迁移中是否比单向或基于翻译的编码器具有优势？

主要发现

在不同数据规模下，双向语言模型（BiLMs）在 POS 标注和 CCG 超标签标注上始终优于其他任务（翻译、skip-thought、自编码）。
仅用 100 万句训练的 BiLMs 就能超过在更大数据量上训练的翻译和 skip-thought 模型，表明语法学习具有数据效率。
即使在相同数据下，BiLMs 常常胜过翻译编码器，对 CCG 超标签标注的优势大于对 POS 标注的优势。
当辅助分类器数据充足时，随机初始化的 LSTM 表现出人意料的良好，但在分类器数据有限时性能崩溃，表明是记忆而非真实句法编码。
单词身份探究表明，训练好的编码器在标注任务上优于未训练的，证实学习到的表征不仅仅捕获简单相邻词信息。
较低层的 LSTM 存储更多直接的邻近信息，而较高层编码更遥远的上下文，表明深度扩展了对句法结构的感受野。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。