[Paper Review] Unbabel's Participation in the WMT20 Metrics Shared Task
This paper presents Unbabel's participation in the WMT20 Metrics Shared Task using an enhanced COMET framework that leverages XLM-RoBERTa for cross-lingual sentence encoding. The authors introduce a multi-reference inference technique and a weighted averaging method for document-level scoring, achieving state-of-the-art or competitive performance across segment-level, document-level, system-level, and QE-as-a-metric tracks on multiple language pairs.
We present the contribution of the Unbabel team to the WMT 2020 Shared Task on Metrics. We intend to participate on the segment-level, document-level and system-level tracks on all language pairs, as well as the 'QE as a Metric' track. Accordingly, we illustrate results of our models in these tracks with reference to test sets from the previous year. Our submissions build upon the recently proposed COMET framework: We train several estimator models to regress on different human-generated quality scores and a novel ranking model trained on relative ranks obtained from Direct Assessments. We also propose a simple technique for converting segment-level predictions into a document-level score. Overall, our systems achieve strong results for all language pairs on previous test sets and in many cases set a new state-of-the-art.
Motivation & Objective
- To improve automatic machine translation evaluation by enhancing the COMET framework for segment-, document-, and system-level scoring.
- To investigate the impact of reference quality versus quantity in multi-reference MT evaluation.
- To develop a robust method for aggregating segment-level scores into a document-level metric.
- To optimize the use of pre-trained cross-lingual models for improved correlation with human judgments.
- To evaluate the effectiveness of ranking models and regressor models in diverse MT evaluation settings.
Proposed method
- Fine-tune XLM-RoBERTa-large as a cross-lingual encoder to generate contextual embeddings for source, hypothesis, and reference texts.
- Train estimator models to regress directly on human quality scores (e.g., Direct Assessment, HTER, MQM) using feed-forward regressors on pooled representations.
- Develop a novel ranking model (COMET-rank) trained on relative ranks from Direct Assessment data to compare MT outputs.
- Implement a multi-reference inference strategy that combines multiple references during inference to improve prediction robustness.
- Propose a weighted average technique to aggregate segment-level scores into a single document-level score.
- Apply layer-wise learning rate decay and freeze embedding layers to improve generalization across language pairs.
Experimental results
Research questions
- RQ1How does the inclusion of multiple references affect the performance of automatic MT evaluation metrics?
- RQ2Does the quality of additional references matter more than their number in improving model correlation with human judgments?
- RQ3Can a unified COMET framework effectively support segment-level, document-level, and system-level MT evaluation?
- RQ4How do different pre-trained models and fine-tuning strategies impact the correlation with human quality scores?
- RQ5What is the optimal method for combining segment-level predictions into a document-level score?
Key findings
- The proposed multi-reference inference technique improves Pearson correlation (r) to 0.455 on the en-de language pair when using a high-quality alternative reference.
- Using a single high-quality reference outperforms using multiple lower-quality references, suggesting reference quality is more critical than quantity.
- The Kendall’s Tau (τ) ranking correlation remained stable across different reference combinations, indicating that segment-level ranking performance is less sensitive to reference quality than regression performance.
- The system achieved state-of-the-art or competitive results across all tracks (segment-level, document-level, system-level, QE-as-a-metric) on multiple language pairs.
- The document-level scoring method based on weighted averaging of segment-level predictions proved effective and consistent across test sets.
- The COMET framework with XLM-RoBERTa-large and fine-tuned regressors outperformed existing metrics like BERTscore, Bleurt, and Prism on the WMT19 test sets.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.