QUICK REVIEW

[Paper Review] When Does Unsupervised Machine Translation Work?

Kelly Marchisio, Kevin Duh|arXiv (Cornell University)|Apr 12, 2020

Natural Language Processing Techniques46 references33 citations

TL;DR

The paper empirically assesses unsupervised MT under dissimilar languages, domain mismatch, and diverse datasets, showing strong results only under closely related conditions and highlighting instability and failure points.

ABSTRACT

Despite the reported success of unsupervised machine translation (MT), the field has yet to examine the conditions under which these methods succeed, and where they fail. We conduct an extensive empirical evaluation of unsupervised MT using dissimilar language pairs, dissimilar domains, diverse datasets, and authentic low-resource languages. We find that performance rapidly deteriorates when source and target corpora are from different domains, and that random word embedding initialization can dramatically affect downstream translation performance. We additionally find that unsupervised MT performance declines when source and target languages use different scripts, and observe very poor performance on authentic low-resource language pairs. We advocate for extensive empirical evaluation of unsupervised MT systems to highlight failure points and encourage continued research on the most promising paradigms.

Motivation & Objective

Assess how unsupervised MT performs when source and target languages are dissimilar.
Evaluate the impact of domain mismatch between monolingual corpora on unsupervised MT.
Test robustness across diverse datasets and low-resource language scenarios.
Highlight failure modes and provide data for stress-testing unsupervised MT systems.

Proposed method

Replicate the Artetxe et al. unsupervised MT pipeline from monolingual corpora to cross-lingual embeddings.
Align monolingual spaces using VecMap to create a bilingual lexicon via cross-domain similarity measures.
Construct an initial SMT phrase-based system from embedding-derived translations and improve with backtranslation.
Integrate an NMT hybridization step with iterative backtranslation to combine SMT and NMT benefits.
Evaluate systems under varied data conditions including Parallel, Disjoint, and Different Domain setups across multiple language pairs and datasets.

Experimental results

Research questions

RQ1Can unsupervised MT work for dissimilar languages (different scripts and language families)?
RQ2How does domain mismatch between source and target monolingual corpora affect translation quality?
RQ3Does performance hold across diverse datasets and authentic low-resource language pairs?
RQ4What are the stability and reliability issues in training unsupervised MT systems under realistic data conditions?

Key findings

Unsupervised MT performance deteriorates rapidly when source and target corpora come from different domains.
Stochasticity in embedding training can dramatically affect bilingual lexicon induction and downstream translation performance.
Unsupervised MT is more challenging for dissimilar language pairs, with larger BLEU gaps observed for Ru-En compared to Fr-En.
Domain mismatch between training corpora and test data can yield very low BLEU scores (e.g., 0.7 for Ru-En in Diff. Dom. conditions).
Authentic true low-resource pairs (Sinhala-English, Nepali-English) show extremely poor unsupervised MT performance without supplemental data.
Training stability is variable across runs, with significant downstream impact from initial embedding space configurations.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.