QUICK REVIEW

[Paper Review] End-to-End Lexically Constrained Machine Translation for Morphologically Rich Languages

de Souza Aires, João Paulo, Dušan Variš|arXiv (Cornell University)|Jan 1, 2021

Natural Language Processing Techniques32 references11 citations

TL;DR

This paper proposes a neural machine translation approach that uses lemmatized target constraints during training to enable correct word inflection in morphologically rich languages like Czech. By conditioning the Transformer model on lemmatized terms alongside the source sentence, the method significantly reduces agreement errors—achieving 77.1% constraint surface form coverage and eliminating 100% of inﬂection errors compared to baseline models.

ABSTRACT

Lexically constrained machine translation allows the user to manipulate the output sentence by enforcing the presence or absence of certain words and phrases. Although current approaches can enforce terms to appear in the translation, they often struggle to make the constraint word form agree with the rest of the generated output. Our manual analysis shows that 46% of the errors in the output of a baseline constrained model for English to Czech translation are related to agreement. We investigate mechanisms to allow neural machine translation to infer the correct word inflection given lemmatized constraints. In particular, we focus on methods based on training the model with constraints provided as part of the input sequence. Our experiments on the English-Czech language pair show that this approach improves the translation of constrained terms in both automatic and manual evaluation by reducing errors in agreement. Our approach thus eliminates inflection errors, without introducing new errors or decreasing the overall quality of the translation.

Motivation & Objective

To address the problem of incorrect word inflection in lexically constrained neural machine translation for morphologically rich languages.
To improve constraint agreement with context by training models on lemmatized rather than surface-form constraints.
To eliminate inﬂection errors in constrained translation without sacrificing overall translation quality or introducing inference overhead.
To evaluate the effectiveness of lemmatized constraints in both synthetic and real-world terminology integration scenarios.

Proposed method

Train a Transformer-based NMT model using lemmatized target constraints concatenated to the source sentence during training.
Integrate lemmatized constraints as part of the input sequence to guide the model in generating contextually appropriate inflected forms.
Use a standard NMT training objective with standard cross-entropy loss, allowing the model to learn inflection patterns end-to-end.
Compare integration methods: concatenating lemmatized constraints to the source sentence versus using input factors to annotate source tokens.
Evaluate on both synthetic test sets and real-world terminology integration tasks using the Europarl-Czech test set.
Leverage the model’s intrinsic language modeling capacity to infer correct surface forms without additional decoding mechanisms.

Experimental results

Research questions

RQ1Can training with lemmatized constraints improve the accuracy of word inflection in constrained neural machine translation for morphologically rich languages?
RQ2Does using lemmatized constraints reduce agreement errors compared to surface-form constraints?
RQ3How does the performance of lemmatized constraint training compare to baseline constrained decoding methods in terms of fluency and constraint coverage?
RQ4To what extent does the model’s ability to generate correct inflections depend on the integration method (concatenation vs. input factors)?
RQ5Can lemmatized constraints improve translation of rare or domain-specific terms without introducing new errors?

Key findings

The lemmatized constraint model achieved 77.1% surface form coverage on the Europarl test set, significantly outperforming the baseline (69.9%) and surface-form model (44%).
Only 8% of examples marked as errors by automatic evaluation were actual errors in the lemmatized model, compared to 66% in the surface-form model, indicating that most errors were artifacts of reference-based evaluation.
The lemmatized model eliminated all inﬂection errors—0% of errors were due to incorrect agreement—while the surface-form model had 46% of errors related to agreement.
The model reduced the number of incorrect word choices from 28 to 4 in the manual analysis, showing improved lexical accuracy.
The approach achieved high constraint coverage and correct inflection without introducing new errors or increasing inference cost.
The method was effective even in rare word translation using a bilingual dictionary, demonstrating robustness in low-resource terminology scenarios.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.