QUICK REVIEW

[论文解读] Variation is the Norm: Embracing Sociolinguistics in NLP

Anne-Marie Lutgen, Alistair Plum|arXiv (Cornell University)|Mar 25, 2026

Natural Language Processing Techniques被引用 0

一句话总结

该论文提出一个社会语言学框架，以在 NLP 中接纳语言变体，证明在 Luxembourgish NLP 任务上，将正字法变体纳入训练可提升微调性能。它比较了标准、非标准和组合训练数据对 Luxembourgish BERT 模型的影响。

ABSTRACT

In Natural Language Processing (NLP), variation is typically seen as noise and "normalised away" before processing, even though it is an integral part of language. Conversely, studying language variation in social contexts is central to sociolinguistics. We present a framework to combine the sociolinguistic dimension of language with the technical dimension of NLP. We argue that by embracing sociolinguistics, variation can actively be included in a research setup, in turn informing the NLP side. To illustrate this, we provide a case study on Luxembourgish, an evolving language featuring a large amount of orthographic variation, demonstrating how NLP performance is impacted. The results show large discrepancies in the performance of models tested and fine-tuned on data with a large amount of orthographic variation in comparison to data closer to the (orthographic) standard. Furthermore, we provide a possible solution to improve the performance by including variation in the fine-tuning process. This case study highlights the importance of including variation in the research setup, as models are currently not robust to occurring variation. Our framework facilitates the inclusion of variation in the thought-process while also being grounded in the theoretical framework of sociolinguistics.

研究动机与目标

Argue that language variation is a fundamental aspect of language and should inform NLP research.
Provide a framework combining sociolinguistic criteria with NLP modeling steps.
Demonstrate, via a Luxembourgish case study, how orthographic variation impacts model performance and how to leverage variation during fine-tuning.

提出的方法

Propose a sociolinguistic NLP framework linking variation space with NLP modelling domains.
Define nine sociolinguistic criteria to describe linguistic entities (varieties/languages) in a sociolinguistic context.
Map five NLP modelling stages to sociolinguistic dimensions to analyze variation impact.
Conduct a case study using Luxembourgish with standard, non-standard, and combined training data.
Manipulate data through normalisation (standardising) and destandardisation (injecting variation) to assess effects on downstream tasks.
Evaluate using Luxembourgish-specific classification tasks and compare LuxemBERT and mBERT performance across variants.

Figure 1: Illustration of the container metaphor for language and variety.

实验结果

研究问题

RQ1How does orthographic variation affect downstream NLP task performance for Luxembourgish?
RQ2Can incorporating variation into training data (via combined standard and non-standard data) improve model robustness and accuracy?
RQ3What are the effects of normalisation versus destandardisation on model fine-tuning for non-standard varieties?

主要发现

Models trained on non-standard data often perform worst on standard or non-standard test sets.
A combined training setup that includes both standard and non-standard variation typically yields the best performance across standard, non-standard, and combined test sets, especially for sequence classification tasks.
Normalisation (standardising) shows limited impact on improving downstream performance compared to incorporating variation in training data.
Luxembourgish pre-trained models (LuxemBERT) generally outperform multilingual models (mBERT) on the evaluated tasks, highlighting benefits of in-language pre-training.
Incorporating variation in training data can yield improvements beyond normalisation, capturing social meaning embedded in variants.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。