QUICK REVIEW

[논문 리뷰] Variation is the Norm: Embracing Sociolinguistics in NLP

Anne-Marie Lutgen, Alistair Plum|arXiv (Cornell University)|2026. 03. 25.

Natural Language Processing Techniques인용 수 0

한 줄 요약

본 논문은 NLP에서 언어 변 variation을 포용하는 사회언어학적 프레임워크를 제시하며, 철자 변형을 포함하는 것이 Luxembourgish NLP 태스크의 미세조정 성능을 개선한다는 것을 보여준다. 표준, 비표준, 그리고 결합 학습 데이터를 Luxembourgish BERT 모델을 사용해 비교한다.

ABSTRACT

In Natural Language Processing (NLP), variation is typically seen as noise and "normalised away" before processing, even though it is an integral part of language. Conversely, studying language variation in social contexts is central to sociolinguistics. We present a framework to combine the sociolinguistic dimension of language with the technical dimension of NLP. We argue that by embracing sociolinguistics, variation can actively be included in a research setup, in turn informing the NLP side. To illustrate this, we provide a case study on Luxembourgish, an evolving language featuring a large amount of orthographic variation, demonstrating how NLP performance is impacted. The results show large discrepancies in the performance of models tested and fine-tuned on data with a large amount of orthographic variation in comparison to data closer to the (orthographic) standard. Furthermore, we provide a possible solution to improve the performance by including variation in the fine-tuning process. This case study highlights the importance of including variation in the research setup, as models are currently not robust to occurring variation. Our framework facilitates the inclusion of variation in the thought-process while also being grounded in the theoretical framework of sociolinguistics.

연구 동기 및 목표

언어 변 variation은 언어의 기본적 특성이며 NLP 연구에 반영되어야 한다고 주장한다.
사회언어학적 기준과 NLP 모델링 단계를 결합하는 프레임워크를 제공한다.
Luxembourgish 사례를 통해 철자 변형이 모델 성능에 미치는 영향과 미세조정 시 변형 활용 방법을 보여준다.

제안 방법

변 variation 공간과 NLP 모델링 도메인을 연결하는 사회언어학적 NLP 프레임워크를 제안한다.
사회언어학적 맥락에서 언어적 개체(다양성/언어)를 설명하기 위한 아홉 가지 사회언어학적 기준을 정의한다.
다섯 가지 NLP 모델링 단계를 사회언어학적 차원에 매핑하여 변 Variation 영향 분석을 수행한다.
표준과 비표준 학습 데이터를 모두 포함하는 케이스 스터디를 Luxembourgish로 수행한다.
정규화(표준화) 및 비표준화(변형 주입)를 통해 데이터를 조작하고 다운스트림 작업에 미치는 영향을 평가한다.
Luxembourgish 특화 분류 작업으로 평가하고, LuxemBERT와 mBERT의 성능을 변형 간에 비교한다.

Figure 1: Illustration of the container metaphor for language and variety.

실험 결과

연구 질문

RQ1철자 변형이 Luxembourgish의 다운스트림 NLP 작업 성능에 어떤 영향을 미치는가?
RQ2표준 및 비표준 데이터를 결합한 학습 데이터 구성을 통해 변 Variation을 도입하면 모델의 강건성과 정확성을 높일 수 있는가?
RQ3비표준 변형에 대한 정규화와 비정규화가 모델 미세조정에 어떤 차이를 만드는가?

주요 결과

비표준 데이터로 학습된 모델이 표준 테스트 세트나 비표준 테스트 세트에서 종종 최악의 성능을 보인다.
표준 및 비표준 변 Variation을 모두 포함하는 결합 학습 구성은 일반적으로 표준, 비표준, 결합 테스트 세트 모두에서 최상의 성능을 내며 특히 시퀀스 분류 작업에서 두드러진다.
정규화(표준화)는 학습 데이터에 변 Variation을 도입하는 것에 비해 다운스트림 성능 개선에 미치는 영향이 제한적이다.
Luxembourgish 프리트레이닝 모델(LuxemBERT)은 평가된 작업에서 일반적으로 다국어 모델(mBERT)보다 우수하여 언어 내 사전학습의 이점을 강조한다.
학습 데이터에 변 Variation을 도입하면 정규화를 넘어서는 개선을 얻을 수 있으며, 변형에 내재된 사회적 의미를 포착한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.