[Paper Review] Polyglot: Distributed Word Representations for Multilingual NLP
This paper introduces Polyglot, a multilingual word embedding system that trains distributed representations for 117 languages using Wikipedia corpora. By leveraging unsupervised learning with efficient deep neural networks, the embeddings achieve competitive performance on part-of-speech tagging—matching or exceeding state-of-the-art results in English, Danish, and Swedish—while preserving language-specific features like case sensitivity.
Distributed word representations (word embeddings) have recently contributed to competitive performance in language modeling and several NLP tasks. In this work, we train word embeddings for more than 100 languages using their corresponding Wikipedias. We quantitatively demonstrate the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. We find their performance to be competitive with near state-of-art methods in English, Danish and Swedish. Moreover, we investigate the semantic features captured by these embeddings through the proximity of word groupings. We will release these embeddings publicly to help researchers in the development and enhancement of multilingual applications.
Motivation & Objective
- To develop a scalable, unsupervised method for learning multilingual word representations that require no expert linguistic knowledge.
- To address the bottleneck in multilingual NLP caused by the need for language-specific feature engineering and manual tuning.
- To create a publicly available, high-quality multilingual embedding resource that supports cross-lingual research and system development.
- To evaluate the utility of these embeddings on a standard NLP task (part-of-speech tagging) across diverse languages with varying resource levels.
- To investigate the linguistic and semantic properties captured by the embeddings, including syntactic and semantic analogies across languages.
Proposed method
- Training continuous distributed word embeddings using a skip-gram model with negative sampling on Wikipedia monolingual corpora for 117 languages with >10,000 articles.
- Preserving case sensitivity for European languages (e.g., not lowercasing) to retain linguistic features, unlike prior English-focused approaches.
- Using a neural network architecture with a context window to predict surrounding words, learning dense vector representations for each word.
- Leveraging optimizations in Theano to enable efficient training on large-scale corpora across multiple languages.
- Initializing a part-of-speech tagger with the pre-trained embeddings and fine-tuning on labeled data to evaluate feature utility.
- Evaluating performance on out-of-vocabulary (OOV) words by substituting them with a single <UNK> token, assessing robustness to OOV handling.
Experimental results
Research questions
- RQ1Can unsupervised word embeddings trained on Wikipedia monolingual corpora achieve competitive performance on part-of-speech tagging across diverse languages without language-specific feature engineering?
- RQ2To what extent do the learned embeddings capture meaningful semantic and syntactic relationships across multiple languages?
- RQ3How does the performance of the embeddings vary with the size of the training corpus, particularly for low-resource languages?
- RQ4What is the impact of preserving case sensitivity in embeddings for European languages compared to lowercasing strategies used in English-only models?
- RQ5How effective are the embeddings as initialization features for downstream NLP tasks, especially in low-resource settings?
Key findings
- The Polyglot embeddings achieved competitive part-of-speech tagging accuracy—matching or exceeding state-of-the-art models in English, Danish, and Swedish, even without language-specific tuning.
- In English, the model outperformed the TnT tagger, achieving a test accuracy of 98.06% on token coverage and 79.73% on word coverage, with a 0.25% improvement over a randomly initialized tagger.
- For low-resource languages like Bulgarian and Slovene, the embeddings still delivered strong performance: Bulgarian achieved 94.58% token coverage and 77.70% word coverage, with a 2.01% accuracy drop compared to a random baseline.
- German and Czech, despite lower Wikipedia article counts, achieved over 98.5% accuracy on known words, indicating robustness of the learned features even with limited data.
- The embeddings significantly improved tagging performance across all languages, with the greatest gains observed in low-resource settings, such as a 2.68% accuracy drop improvement for Slovene compared to random initialization.
- The vocabulary coverage of the embeddings on out-of-domain part-of-speech datasets varied by language, with English showing 98.06% token coverage and Slovene 95.33%, reflecting differences in domain shift and vocabulary overlap.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.