QUICK REVIEW

[論文レビュー] Estimating post-editing effort: a study on human judgements, task-based and reference-based metrics of MT quality

Carolina Scarton, Mikel L. Forcada|arXiv (Cornell University)|Nov 2, 2019

Natural Language Processing Techniques参考文献 30被引用数 2

ひとこと要約

本研究では、機械翻訳における修正作業量を推定するためのメトリクスを評価し、タスクベースのメトリクス、人間による判断（DA）、およびリファレンスベースのメトリクスを比較している。その結果、機械翻訳と修正済みテキストの差異を測定するタスクベースのメトリクスが、修正作業量を最も正確に追跡でき、次に直接評価とリファレンスベースのメトリクスが続くことが判明した。

ABSTRACT

Devising metrics to assess translation quality has always been at the core of machine translation (MT) research. Traditional automatic reference-based metrics, such as BLEU, have shown correlations with human judgements of adequacy and fluency and have been paramount for the advancement of MT system development. Crowd-sourcing has popularised and enabled the scalability of metrics based on human judgments, such as subjective direct assessments (DA) of adequacy, that are believed to be more reliable than reference-based automatic metrics. Finally, task-based measurements, such as post-editing time, are expected to provide a more de- tailed evaluation of the usefulness of translations for a specific task. Therefore, while DA averages adequacy judgements to obtain an appraisal of (perceived) quality independently of the task, and reference-based automatic metrics try to objectively estimate quality also in a task-independent way, task-based metrics are measurements obtained either during or after performing a specific task. In this paper we argue that, although expensive, task-based measurements are the most reliable when estimating MT quality in a specific task; in our case, this task is post-editing. To that end, we report experiments on a dataset with newly-collected post-editing indicators and show their usefulness when estimating post-editing effort. Our results show that task-based metrics comparing machine-translated and post-edited versions are the best at tracking post-editing effort, as expected. These metrics are followed by DA, and then by metrics comparing the machine-translated version and independent references. We suggest that MT practitioners should be aware of these differences and acknowledge their implications when decid- ing how to evaluate MT for post-editing purposes.

研究の動機と目的

異なるMT品質メトリクスの信頼性が、修正作業量の推定にどの程度適しているかを評価すること。
タスクベースのメトリクス（修正時間および作業量）、人間による判断（適切性の直接評価）、およびリファレンスベースの自動メトリクス（例：BLEU）を比較すること。
実際の翻訳タスクにおける実際の修正作業負荷と最も相関の高いメトリクスタイプを特定すること。
MT実務者に対して、修正作業シナリオにおける適切な評価手法の選定に関する実用的ガイダンスを提供すること。

提案手法

修正作業のインジケータ（時間および作業量の測定値を含む）を収集した新しいデータセットを収集した。
機械翻訳版と修正済み版の比較により、タスクベースのメトリクスを適用して、修正で加えられた変更を定量化した。
機械翻訳の適切性に関する人間による判断を得るために、直接評価（DA）を用いた。
独立したリファレンス翻訳を用いて、リファレンスベースの自動メトリクス（例：BLEU）を計算した。
各メトリクスタイプを、データセットで測定された実際の修正作業量と相関させた。
統計的分析を用いて各メトリクスタイプの予測能力を評価し、その有効性の順位を付けることを行った。

実験結果

リサーチクエスチョン

RQ1タスクベースのメトリクスは、実際の修正作業量とどの程度相関しているか？
RQ2人間による直接評価（DA）スコアは、タスクベースのメトリクスと比較して、修正作業量の推定にどの程度有効か？
RQ3リファレンスベースの自動メトリクス（例：BLEU）は、タスクベースおよびDAメトリクスと比較して、修正作業量の予測にどの程度効果的か？
RQ4実際の翻訳環境において、どのメトリクスタイプが最も信頼性の高い修正作業負荷推定を提供するか？

主な発見

機械翻訳と修正済みテキストの差異を測定するタスクベースのメトリクスは、実際の修正作業量と最も強い相関を示した。
適切性の直接評価（DA）は、修正作業量の2番目に優れた予測要因であり、人間の判断が依然として価値あるものであることを示している。
リファレンスベースの自動メトリクス（例：BLEU）は、広く使用されているにもかかわらず、修正作業量の推定において最も効果が低かった。
本研究では、タスクベースのメトリクスが、タスク固有の性質を有するため、修正作業文脈におけるMT品質評価に最も信頼性が高いと確認された。
結果は、修正作業評価に単にリファレンスベースのメトリクスに依存することの限界を浮き彫りにしている。
MT実務者は、修正ワークフローにおけるシステム評価を行う際、正確な作業量推定を確保するため、タスクベースのメトリクスを優先すべきである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。