[Paper Review] Large Language Models Are State-of-the-Art Evaluators of Translation Quality
The paper introduces GEMBA, a GPT-based metric for translation quality assessment that works with and without references, showing state-of-the-art system-level accuracy on WMT22 MQM data across three language pairs, using zero-shot prompts and various GPT models. It releases code and prompts for reproducibility.
We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate nine versions of GPT models, including ChatGPT and GPT-4. We show that our method for translation quality assessment only works with GPT~3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.
Motivation & Objective
- Demonstrate that GPT-based prompts can accurately assess translation quality at the system level.
- Evaluate multiple GPT models and four prompt variants in reference-based and no-reference modes.
- Compare GEMBA against WMT22 metrics to establish state-of-the-art performance.
- Analyze segment- vs. system-level performance and model behavior across language pairs.
Proposed method
- Define GEMBA as a per-segment scoring mechanism that aggregates to a system-level score.
- Experiment with four prompt templates (DA, SQM, Stars, Classes) in two modes (with and without reference).
- Use nine GPT models, with GPT-4 as default, to produce zero-shot segment scores.
- Aggregate scores across segments to obtain system-level metrics.
- Evaluate against MQM-based human labels from WMT22 and compare to leading automatic metrics (e.g., COMET, BLEURT).
- Assess robustness, failure rates, and segment-level correlations (Kendall’s Tau)
Experimental results
Research questions
- RQ1Can LLMs, via prompting, reliably assess translation quality without fine-tuning?
- RQ2Which prompt templates and GPT models yield the best correlation with human MQM judgments?
- RQ3Do reference-based and no-reference GEMBA variants achieve state-of-the-art performance on WMT22 data?
- RQ4How do GEMBA's system-level results compare to existing metrics across language pairs?
- RQ5What are the limitations and variability at segment-level versus system-level?
Key findings
- GEMBA with GPT-4 in the reference-based setting achieves state-of-the-art system-level accuracy on MQM 2022 data across en-de, en-ru, and zh-en.
- GEMBA with GPT-4 in the no-reference setting (quality estimation) yields the highest system-level performance among no-reference metrics, closely approaching reference-based GEMBA.
- Among four prompt variants, the least constrained Direct Assessment (DA) template performed best.
- GPT-3.5 and larger models are necessary for translation quality assessment; GPT-2 and Ada perform poorly or not at all.
- Segment-level correlations (Kendall’s Tau) are high for GPT-4 and Davinci-003, though still behind top metrics, and discrete scoring may affect Tau due to ties.
- GEMBA-DA and related prompts show robustness with less than 1% invalid answers across prompts and models.
- The study provides publicly available code, prompts, and results for external validation and reproducibility.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.