QUICK REVIEW

[논문 리뷰] From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

Abdulmuizz Khalak, Abderrahmane Issam|arXiv (Cornell University)|2026. 02. 10.

Natural Language Processing Techniques인용 수 0

한 줄 요약

이 논문은 중고전통적 표준 아랍어(MSA)와 방언 간의 교차-전이(cross-lingual transfer)를 probing과 Representational Similarity Analysis(RSA) 및 Centered Kernel Alignment(CKA)을 활용하여 조사하고, 전이는 가능하지만 방언마다 불균등하게 나타나며 지리적 근접성 및 프리트레이닝 데이터 규모의 영향이 있음을 발견한다.

ABSTRACT

Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.

연구 동기 및 목표

이중언어 현상과 방언의 다양성으로 인해 아랍어 방언과 MSA 간의 교차-언어 전이 연구를 촉진한다.
세 가지 NLP 작업(SA, NER, POS)에서 probing을 사용하여 전이를 평가하고 CKA를 통해 표현 유사성을 측정한다.
MSA 중심, 혼합 방언, 방언별 모델이 방언 변이에서 어떻게 성능을 내는지 평가한다.
지리적 근접성과 프리트레이닝 데이터 양을 포함한 전이 차이의 원인을 조사한다.

제안 방법

고정된 계층별 임베딩의 probing과 선형 분류기를 결합하여 인코딩된 언어 특징을 평가한다.
MSA와 방언 모델 간의 계층별 표현 유사성을 정량화하기 위해 Representational Similarity Analysis(CKA)를 적용한다.
MSA 및 방언 인코더에 대해 여러 시나리오에서 CKA를 계산하기 위해 병렬 MADAR 데이터를 사용한다.
지리적 근접성 프록시를 도입하고(MSA 앵커로 예멘을 활용) 전이를 방언 연속체에 연결한다.
방언 및 MSA 데이터 세트에서 POS, NER, SA 작업에 대해 모델을 평가한다.

Figure 1: Architecture of the probing classifier for the example sentence “The boy is eating the apple now.” Sentence representations pass through N layers, and each layer is probed using the classifier in Eq. 1 .

실험 결과

연구 질문

RQ1MSA로 학습된 표현이 POS, NER, SA 작업 전반에 걸쳐 방언 아랍어에 얼마나 전이 가능한가?
RQ2자체 방언에 대해 방언별 모델이 일반적인 MSA 기반 모델보다 우수한가, 그리고 어떤 데이터 조건에서 그런가?
RQ3MSA와 방언 모델 간의 표현 유사성(CKA)이 전이 효과성과 어떤 관련이 있는가?
RQ4MSA에 대한 지리적 근접성이 전이 가능성과 표현의 유사성을 예측하는가?

주요 결과

MSA 중심 모델은 일반적으로 방언으로의 전이가 잘 이루어지며, 특정 작업에서 방언별 모델을 능가하는 경우가 있다.
방언별 사전학습 데이터가 충분히 확보될 경우 방언별 모델이 일반 모델을 능가하는 경향이 있다.
전이 및 표현 유사성은 지리적 근접성과 맞물린 방언 연속체를 보이며, 다만 데이터 크기가 이 효과를 조절한다.
다중 방언 모델에서 음의 간섭이 발생할 수 있으며, 특히 자원이 풍부한 방언에서 그러하며 넓은 다중 방언 사전학습의 한계를 시사한다.
CKA 유사성이 기능적 전이를 보장하지 않는다는 점을 시사하며, 구조적 유사성과 작업 성능 간의 간극을 강조한다.

Figure 2: Architecture of CKA for representation similarity. MADAR parallel sentences are encoded by MSA and DA encoders through N layers, and the resulting representations are compared using linear CKA (Eq. 2 ).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.