QUICK REVIEW

[论文解读] Estimating post-editing effort: a study on human judgements, task-based and reference-based metrics of MT quality

Carolina Scarton, Mikel L. Forcada|arXiv (Cornell University)|Nov 2, 2019

Natural Language Processing Techniques参考文献 30被引用 2

一句话总结

本研究评估了用于估算机器翻译后编辑工作量的各类度量标准，比较了基于任务的度量、人工判断（DA）和基于参考文本的度量。研究发现，基于任务的度量——通过比较机器翻译文本与人工编辑后文本之间的差异来衡量——最准确地跟踪了后编辑工作量，其次是直接评估法和基于参考文本的度量。

ABSTRACT

Devising metrics to assess translation quality has always been at the core of machine translation (MT) research. Traditional automatic reference-based metrics, such as BLEU, have shown correlations with human judgements of adequacy and fluency and have been paramount for the advancement of MT system development. Crowd-sourcing has popularised and enabled the scalability of metrics based on human judgments, such as subjective direct assessments (DA) of adequacy, that are believed to be more reliable than reference-based automatic metrics. Finally, task-based measurements, such as post-editing time, are expected to provide a more de- tailed evaluation of the usefulness of translations for a specific task. Therefore, while DA averages adequacy judgements to obtain an appraisal of (perceived) quality independently of the task, and reference-based automatic metrics try to objectively estimate quality also in a task-independent way, task-based metrics are measurements obtained either during or after performing a specific task. In this paper we argue that, although expensive, task-based measurements are the most reliable when estimating MT quality in a specific task; in our case, this task is post-editing. To that end, we report experiments on a dataset with newly-collected post-editing indicators and show their usefulness when estimating post-editing effort. Our results show that task-based metrics comparing machine-translated and post-edited versions are the best at tracking post-editing effort, as expected. These metrics are followed by DA, and then by metrics comparing the machine-translated version and independent references. We suggest that MT practitioners should be aware of these differences and acknowledge their implications when decid- ing how to evaluate MT for post-editing purposes.

研究动机与目标

评估不同机器翻译质量度量标准在估算后编辑工作量方面的可靠性。
比较基于任务的度量（后编辑时间与工作量）、人工判断（对翻译准确度的直接评估）以及基于参考文本的自动度量（如BLEU）。
确定哪种度量类型与真实世界翻译任务中的实际后编辑工作量相关性最高。
为机器翻译从业者提供在后编辑场景中选择合适评估方法的实用指导。

提出的方法

收集了一个新的数据集，其中包含后编辑过程中的指标，包括时间和工作量测量值。
通过比较机器翻译版本与后编辑后的版本，应用基于任务的度量来量化所作修改。
使用直接评估（DA）收集人工对机器翻译准确度的判断。
使用独立的参考译文计算基于参考文本的自动度量（如BLEU）。
将每类度量与数据集中实际测量的后编辑工作量进行相关性分析。
通过统计分析评估每类度量的预测能力，以排序其有效性。

实验结果

研究问题

RQ1基于任务的度量与实际后编辑工作量的相关性如何？
RQ2人工直接评估（DA）得分在估算后编辑工作量方面与基于任务的度量相比表现如何？
RQ3基于参考文本的自动度量（如BLEU）在预测后编辑工作量方面，与基于任务的度量和DA度量相比表现如何？
RQ4在实际翻译场景中，哪种度量类型能提供最可靠的后编辑工作量估算？

主要发现

基于任务的度量（通过比较机器翻译文本与后编辑后文本之间的差异来衡量）与实际后编辑工作量的相关性最强。
对准确度的直接评估（DA）是第二好的后编辑工作量预测指标，表明人工判断仍然具有重要价值。
尽管应用广泛，基于参考文本的自动度量（如BLEU）在估算后编辑工作量方面表现最不理想。
本研究证实，由于其任务特异性，基于任务的度量在后编辑情境中评估机器翻译质量最为可靠。
研究结果凸显了仅依赖基于参考文本的度量进行后编辑评估的局限性。
机器翻译从业者在评估后编辑工作流程中的系统时，应优先采用基于任务的度量，以确保工作量估算的准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。