Skip to main content
QUICK REVIEW

[Paper Review] Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing

Milan Straka, Jana Straková|arXiv (Cornell University)|Aug 20, 2019
Natural Language Processing Techniques14 references38 citations
TL;DR

This work compares BERT, Flair, and ELMo contextualized embeddings across 54 languages (89 UD 2.3 corpora) as inputs to UDPipe 2.0, achieving state-of-the-art results and detailing how embeddings complement traditional word and character features.

ABSTRACT

We present an extensive evaluation of three recently proposed methods for contextualized embeddings on 89 corpora in 54 languages of the Universal Dependencies 2.3 in three tasks: POS tagging, lemmatization, and dependency parsing. Employing the BERT, Flair and ELMo as pretrained embedding inputs in a strong baseline of UDPipe 2.0, one of the best-performing systems of the CoNLL 2018 Shared Task and an overall winner of the EPE 2018, we present a one-to-one comparison of the three contextualized word embedding methods, as well as a comparison with word2vec-like pretrained embeddings and with end-to-end character-level word embeddings. We report state-of-the-art results in all three tasks as compared to results on UD 2.2 in the CoNLL 2018 Shared Task.

Motivation & Objective

  • Assess the effectiveness of three contextualized embedding methods (BERT, Flair, ELMo) as additional inputs to a strong multilingual parsing system.
  • Perform a one-to-one comparison of the three embedding approaches across 89 UD 2.3 treebanks in 54 languages.
  • Compare contextualized embeddings with traditional word2vec-like embeddings and end-to-end character-level word embeddings.
  • Determine language-resource effects and analyze whether multilingual or language-specific BERT models yield better performance.
  • Report state-of-the-art results relative to UD 2.2 and document performance on UD 2.3.

Proposed method

  • Use UDPipe 2.0 as a strong baseline system for POS tagging, lemmatization, and dependency parsing.
  • Embed input words with three contextualized representations (BERT, Flair, ELMo) and average subword/last-layer outputs to obtain word-level embeddings.
  • Compare against baselines of FastText word embeddings (WE) and character-level word embeddings (CLE).
  • Experiment with multilingual and language-specific BERT models, and with Flair and ELMo where available.
  • Evaluate on UD 2.3 treebanks (89 corpora, 54 languages) and report macro-averaged results where multiple treebanks exist.

Experimental results

Research questions

  • RQ1How do BERT, Flair, and ELMo contextualized embeddings compare when used as inputs to UDPipe 2.0 across many languages and tasks (POS tagging, lemmatization, dependency parsing)?
  • RQ2Do contextualized embeddings provide complementary information to word embeddings and character-level features, and how does combining them affect performance?
  • RQ3Are multilingual BERT models nearly as effective as language-specific ones, and how does performance vary by language and data availability?
  • RQ4What is the relative impact of contextualized embeddings on UPOS, XPOS, morphological features, lemmas, UAS, LAS, MLAS, and BLEX across UD 2.3?
  • RQ5What are the best-performing configurations (embedding combinations) for achieving state-of-the-art results on UD 2.3 tasks?

Key findings

  • Adding contextualized embeddings as inputs to UDPipe 2.0 yields substantial performance gains across languages and tasks.
  • BERT embeddings provide the largest improvements, achieving state-of-the-art results in UD Shared Task-style evaluation and offering the most complementary information to WE and CLE.
  • Flair embeddings capture morphological and orthographical information, performing well on POS tagging and lemmatization but less so on dependency parsing compared to BERT.
  • ELMo embeddings (English only) perform strongly on English treebanks, especially for morphology, but generally lag behind BERT in parsing; combining ELMo with WE/CLE can still be beneficial for certain metrics.
  • Combining WE+CLE+BERT (and Flair where available) yields the best overall results, with notable relative error reductions summarized as up to 16.9% for UPOS, 14.5% for parsing, and smaller gains for other metrics; multilingual BERT often matches language-specific BERT performance, particularly for English, and benefits from larger pretraining data.
  • On UD 2.3, BERT+Flair+WE+CLE achieves the strongest results in many settings, with language-specific nuances: some languages not in BERT training still benefit from multilingual BERT.
  • Across 89 UD 2.3 treebanks, averaging effects show notable gains for UPOS, UAS, and LAS, while lemmatization can see mixed results depending on language and embeddings used.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.