QUICK REVIEW

[论文解读] How much pretraining data do language models need to learn syntax?

Laura Pérez-Mayos, Miguel Ballesteros|arXiv (Cornell University)|Sep 7, 2021

Topic Modeling被引用 4

一句话总结

本研究通过在100万至100亿词的语料上训练的MiniBERTa模型，探究了预训练数据规模对RoBERTa模型句法学习的影响。研究发现，尽管更大的数据能提升句法编码能力和下游性能，但收益呈渐进式增长，且伴随高昂的财务与环境成本，部分特定句法现象中，较小模型甚至优于更大模型。

ABSTRACT

Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study of the impact of pretraining data size on the knowledge of the models. We explore this impact on the syntactic capabilities of RoBERTa, using models trained on incremental sizes of raw text data. First, we use syntactic structural probes to determine whether models pretrained on more data encode a higher amount of syntactic information. Second, we perform a targeted syntactic evaluation to analyze the impact of pretraining data size on the syntactic generalization performance of the models. Third, we compare the performance of the different models on three downstream applications: part-of-speech tagging, dependency parsing and paraphrase identification. We complement our study with an analysis of the cost-benefit trade-off of training such models. Our experiments show that while models pretrained on more data encode more syntactic knowledge and perform better on downstream applications, they do not always offer a better performance across the different syntactic phenomena and come at a higher financial and environmental cost.

研究动机与目标

评估增加预训练数据规模对RoBERTa模型句法知识获取的影响。
评估更大模型在多样化句法现象上的泛化能力是否更优。
比较不同数据规模模型在词性标注、依存句法分析和释义识别三项下游任务上的表现。
分析训练更大模型所涉及的财务与环境成本-收益权衡。
确定困惑度是否与句法泛化能力提升相关。

提出的方法

在100万至100亿词的增量数据规模上训练12个RoBERTa模型（MiniBERTa）。
应用Hewitt和Manning（2019b）的句法结构探针，测量句法信息编码程度。
使用SyntaxGym和Hu等人（2020）的句法测试套件，评估在6个测试回路中的句法泛化能力。
在三个下游任务上微调模型：词性标注、依存句法分析（LAS）和释义识别（F1）。
基于计算资源使用量和每模型的预训练运行次数，估算训练成本与二氧化碳排放量。
开展成本-收益分析，比较性能增益与财务及环境成本。

实验结果

研究问题

RQ1增加预训练数据规模是否导致RoBERTa模型句法信息编码水平提高？
RQ2在更多数据上预训练的模型是否在多样化句法现象上具有更好的泛化能力？
RQ3下游任务上的性能提升是否与预训练数据规模的增加成比例？
RQ4训练更大模型的财务与环境成本是多少？这些成本是否由性能增益所合理化？
RQ5困惑度与句法泛化性能之间是否存在相关性？

主要发现

根据Hewitt和Manning的结构探针测量，预训练数据更多的模型编码了显著更多的句法信息。
尽管句法编码水平更高，但最大规模模型（100亿词）在Gross Syntactic State测试回路中表现不如较小模型，后者得分更优。
在下游任务（词性标注、依存句法分析、释义识别）中，性能提升呈渐进式，100亿词模型相比1亿词模型仅提升0.5%–2.02%。
训练100亿词模型的财务成本为20,000美元，二氧化碳排放量约为6,990磅，超过跨大西洋航班的排放量。
成本-收益分析表明，更大模型带来的微小性能增益，伴随着不成比例的高昂财务与环境成本。
未发现困惑度与SyntaxGym得分之间存在明确相关性，表明较低困惑度并不保证更好的句法泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。