QUICK REVIEW

[論文レビュー] GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels

Jianhao Yan, Pingchuan Yan|arXiv (Cornell University)|Jul 4, 2024

Natural Language Processing Techniques被引用数 5

ひとこと要約

GPT-4 は翻訳品質において総エラー数でジュニア翻訳者と同等であるが、ミディアムおよびシニア翻訳者には及ばず、言語とドメインによって性能が異なり、字義的な翻訳傾向がある。

ABSTRACT

This study comprehensively evaluates the translation quality of Large Language Models (LLMs), specifically GPT-4, against human translators of varying expertise levels across multiple language pairs and domains. Through carefully designed annotation rounds, we find that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators. We also observe the imbalanced performance across different languages and domains, with GPT-4's translation capability gradually weakening from resource-rich to resource-poor directions. In addition, we qualitatively study the translation given by GPT-4 and human translators, and find that GPT-4 translator suffers from literal translations, but human translators sometimes overthink the background information. To our knowledge, this study is the first to evaluate LLMs against human translators and analyze the systematic differences between their outputs, providing valuable insights into the current state of LLM-based translation and its potential limitations.

研究の動機と目的

GPT-4 の翻訳品質を、さまざまなエキスパート閑度の人間翻訳者と複数の語対・ドメインで比較評価する。
リソース豊富な言語からリソース不足言語へ翻訳性能を較正する。
LLM 翻訳と人間翻訳の体系的な差異と定性的特徴を同定する。

提案手法

MQM framework を用いて、専門アノテータによる盲検条件の翻訳エラーを注釈する。
英語↔中国語、英語↔ロシア語、英語↔ヒンディー語の6つの言語方向と、中国語↔英語の2つのドメイン（生物医療と技術）を評価する。
GPT-4 に3つの候補プロンプトを与え、COMET-QE 評価で最良を選択する。
比較のためにジュニア・ミディアム・シニアレベルの人間翻訳者を含め、機械翻訳支援を避けるよう援助を制限する。
Cohen’s Kappa と Krippendorff’s Alpha を用いてアノテーション信頼性を確保する。

GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels

実験結果

リサーチクエスチョン

RQ1GPT-4 の翻訳品質は、複数の言語・ドメインにわたって、異なる専門知識レベルの人間翻訳者とどのように比較されるか？
RQ2LLM 翻訳と人間翻訳のエラータイプ・言語行動には体系的な差異があるか？
RQ3リソース豊富言語方向からリソース不足言語方向へ、GPT-4 の性能は低下するか？
RQ4GPT-4 の翻訳が人間翻訳と比べてどのような定性的特徴（直訳性、過度の推測、あるいは幻訳など）を示すか？

主な発見

GPT-4 はジュニア翻訳者と総エラー水準が同等であるが、ミディアムおよびシニア翻訳者には及ばない。
GPT-4 の性能はリソース豊富言語方向からリソース不足言語方向へ低下し、中国語↔英語では比較的良好だが、中国語↔ヒンディー語では劣る。
GPT-4 は人間より直訳的な翻訳が多く、付加・省略は少ない一方、語彙・文体・文法の不正確さが目立つ。
ドメイン分析では、技術と生物医療ドメインで中程度の翻訳者に近いが、最新の固有表現知識の欠如により一般ニュースドメインでは劣度が大きい。
定性的ケーススタディは、GPT-4 が人間より想像された内容を回避できるケースが多い一方で、人間は欠落情報を過剰解釈することがある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。