[论文解读] Preferences for Idiomatic Language are Acquired Slowly -- and Forgotten Quickly: A Case Study on Swedish
论文表明,语言模型在瑞典语习惯用法方面的能力在预训练阶段发展缓慢,在使用翻译数据进行指令微调后迅速丧失。
In this study, we investigate how language models develop preferences for extit{idiomatic} as compared to extit{linguistically acceptable} Swedish, both during pretraining and when adapting a model from English to Swedish. To do so, we train models on Swedish from scratch and by fine-tuning English-pretrained models, probing their preferences at various checkpoints using minimal pairs that differ in linguistic acceptability or idiomaticity. For linguistic acceptability, we adapt existing benchmarks into a minimal-pair format. To assess idiomaticity, we introduce two novel datasets: one contrasting conventionalized idioms with plausible variants, and another contrasting idiomatic Swedish with Translationese. Our findings suggest that idiomatic competence emerges more slowly than other linguistic abilities, including grammatical and lexical correctness. While longer training yields diminishing returns for most tasks, idiom-related performance continues to improve, particularly in the largest model tested (8B). However, instruction tuning on data machine-translated from English -- the common approach for languages with little or no native instruction data -- causes models to rapidly lose their preference for idiomatic language.
研究动机与目标
- 研究语言模型在预训练阶段及从英语到瑞典语的适应过程中,习语性与语言学可接受性的取得差异。
- 考察继续预训练与从头训练对习语能力的影响。
- 评估用机器翻译数据进行指令微调对瑞典语习语偏好的影响。
提出的方法
- 从零开始在瑞典语数据上训练135M参数的SmolLM模型,并进行继续预训练。
- 使用最小对比对基准探测习语性与语言可接受性,包括新的瑞典语习语数据集和翻译性对比。
- 在多次检查点对模型进行评估,使用每个词元困惑度来确定在每个最小对比中更偏好的句子。
实验结果
研究问题
- RQ1RQ1 与一般语言可接受性相比,语言模型在获得习语偏好方面的速度与程度如何?
- RQ2RQ2 用机器翻译数据进行指令微调如何影响模型的习语偏好?
主要发现
- 习语的获得速度较语言词汇或句法准确性慢于其他能力。
- 英语预训练提升总体表现并促进逐渐的习语语言获取。
- 用翻译数据进行指令微调显著降低习语偏好,即使一般语言可接受性相对稳定。
- 更大或更强的模型(AI Sweden LLaMA 8B)在继续预训练下表现出更强的习语能力提升。
- 翻译性样本学习效果差,且在基于翻译的微调中可能进一步下降,凸显习语性对这类微调的脆弱性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。