QUICK REVIEW

[论文解读] Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Tom Kocmi, Christian Federmann|arXiv (Cornell University)|Feb 28, 2023

Topic Modeling被引用 108

一句话总结

本文提出 GEMBA，一种基于 GPT 的翻译质量评估度量，可以在有参考和无参考的情况下工作，在 WMT22 MQM 数据上对三个语言对实现了最先进的系统级准确性，使用零-shot 提示和多种 GPT 模型。它发布了代码和提示以实现可重复性。

ABSTRACT

We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate nine versions of GPT models, including ChatGPT and GPT-4. We show that our method for translation quality assessment only works with GPT~3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

研究动机与目标

证明基于 GPT 的提示可以在系统层面准确评估翻译质量。
在参考基准与无参考模式下评估多种 GPT 模型和四种提示变体。
将 GEMBA 与 WMT22 指标进行比较，以确立最先进的性能。
分析分段与系统层面的表现，以及在语言对上的模型行为。

提出的方法

将 GEMBA 定义为逐分段评分机制，汇总为系统层面分数。
在两种模式（有参考与无参考）下，使用四种提示模板（DA、SQM、Stars、Classes）进行实验。
使用九种 GPT 模型，以 GPT-4 作为默认，产生零-shot 的分段分数。
跨分段汇总分数以获得系统层面指标。
基于 WMT22 的 MQM 人类标签进行评估，并与领先的自动评估指标（如 COMET、BLEURT）进行比较。
评估鲁棒性、失败率，以及分段级相关性（Kendall’s Tau）。

实验结果

研究问题

RQ1通过提示，LLM 能否在不进行微调的情况下可靠评估翻译质量？
RQ2哪些提示模板与 GPT 模型能给出与人类 MQM 判断最好的相关性？
RQ3参考基与无参考 GEMBA 变体在 WMT22 数据上是否达到最先进的性能？
RQ4GEMBA 的系统层面结果与现有指标在语言对之间的比较如何？
RQ5在分段级与系统级之间的局限性与变异性是什么？

主要发现

在参考基设置下，GEMBA 与 GPT-4 在 MQM 2022 数据上实现跨 en-de、en-ru、zh-en 的系统层面最先进准确性。
在无参考设置（质量估计）下，GEMBA 与 GPT-4 实现无参考指标中最高的系统层面性能，接近参考基 GEMBA。
在四种提示变体中，约束最少的 Direct Assessment（DA）模板表现最佳。
翻译质量评估需要 GPT-3.5 及更大模型；GPT-2 与 Ada 表现较差或几乎无效。
分段层相关性（Kendall’s Tau）对 GPT-4 与 Davinci-003 较高，尽管仍落后于顶尖指标，且离散评分可能因并列导致 Tau 偏低。
GEMBA-DA 及相关提示在所有提示与模型中表现出鲁棒性， invalid 答案比例低于 1%。
本研究提供公开可用的代码、提示与结果，便于外部验证与可重复性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。