QUICK REVIEW

[论文解读] GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels

Jianhao Yan, Pingchuan Yan|arXiv (Cornell University)|Jul 4, 2024

Natural Language Processing Techniques被引用 5

一句话总结

GPT-4 在翻译质量方面与初级译者在总错误数方面相当，但落后于中级和高级译者，且表现因语言和领域而异，并且有趋向于字面翻译的倾向。

ABSTRACT

This study comprehensively evaluates the translation quality of Large Language Models (LLMs), specifically GPT-4, against human translators of varying expertise levels across multiple language pairs and domains. Through carefully designed annotation rounds, we find that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators. We also observe the imbalanced performance across different languages and domains, with GPT-4's translation capability gradually weakening from resource-rich to resource-poor directions. In addition, we qualitatively study the translation given by GPT-4 and human translators, and find that GPT-4 translator suffers from literal translations, but human translators sometimes overthink the background information. To our knowledge, this study is the first to evaluate LLMs against human translators and analyze the systematic differences between their outputs, providing valuable insights into the current state of LLM-based translation and its potential limitations.

研究动机与目标

评估GPT-4在多种语言对和领域中与不同专业水平的人类译者的翻译质量。
将翻译性能从资源丰富的语言扩展到资源匮乏语言进行校准。
识别LLM翻译与人类翻译之间的系统性差异与定性特征。

提出的方法

使用MQM框架，在盲评条件下由专家标注翻译错误。
评估六个语言方向（English↔Chinese, English↔Russian, English↔Hindi）以及两个领域（生物医学和技术）对Chinese↔English翻译的影响。
用三个候选提示对GPT-4进行提示，并通过COMET-QE评估选择最佳提示。
在比较中纳入初级、中级和高级水平的人类译者；限制辅助以避免机器翻译辅助。
使用Cohen’s Kappa和Krippendorff’s Alpha计算评注者间一致性，以确保标注可靠性。

GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels

实验结果

研究问题

RQ1GPT-4的翻译质量在多语言和多领域方面与不同专业水平的人类译者相比如何？
RQ2LLM翻译与人类翻译之间是否存在系统性的错误类型和语言行为差异？
RQ3GPT-4在资源丰富语言方向到资源匮乏语言方向的性能是否下降？
RQ4哪些定性特征将GPT-4翻译与人类翻译区分开来（例如字面性、过度推理或幻觉）？

主要发现

GPT-4在总错误水平上与初级译者相当，但落后于中级和高级译者。
GPT-4的表现从资源丰富语言向资源匮乏语言方向下降，在中文↔英文方面表现相对良好，但在中文↔印地语方面表现较差。
与人类相比，GPT-4的翻译更字面，添加/省略的情况较少，但在词汇、风格和语法方面存在不准确之处。
在领域分析中，GPT-4在技术和生物医学领域接近中等译者，但在通用新闻领域较弱，因缺乏最新的实体知识。
定性案例研究显示，GPT-4在避免臆想内容方面优于人类，而人类有时会对缺失信息进行过度解读。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。