QUICK REVIEW

[論文レビュー] Translationese in Machine Translation Evaluation

Yvette Graham, Barry Haddow|arXiv (Cornell University)|Jun 24, 2019

Natural Language Processing Techniques参考文献 13被引用数 57

ひとこと要約

この論文は翻訳品質（translationese）が機械翻訳評価に与える影響を分析し、逆に作成されたテストデータが結果を偏らせ得ることを示し、人間と機械翻訳のパリティ主張の再評価、テストの統計的検出力の分析、そして実践的な評価チェックリストを提案します。

ABSTRACT

The term translationese has been used to describe the presence of unusual features of translated text. In this paper, we provide a detailed analysis of the adverse effects of translationese on machine translation evaluation results. Our analysis shows evidence to support differences in text originally written in a given language relative to translated text and this can potentially negatively impact the accuracy of machine translation evaluations. For this reason we recommend that reverse-created test data be omitted from future machine translation test sets. In addition, we provide a re-evaluation of a past high-profile machine translation evaluation claiming human-parity of MT, as well as analysis of the since re-evaluations of it. We find potential ways of improving the reliability of all three past evaluations. One important issue not previously considered is the statistical power of significance tests applied in past evaluations that aim to investigate human-parity of MT. Since the very aim of such evaluations is to reveal legitimate ties between human and MT systems, power analysis is of particular importance, where low power could result in claims of human parity that in fact simply correspond to Type II error. We therefore provide a detailed power analysis of tests used in such evaluations to provide an indication of a suitable minimum sample size of translations for such studies. Subsequently, since no past evaluation that aimed to investigate claims of human parity ticks all boxes in terms of accuracy and reliability, we rerun the evaluation of the systems claiming human parity. Finally, we provide a comprehensive check-list for future machine translation evaluation.

研究の動機と目的

翻訳品質が人間と自動機械翻訳評価の結果にいかに影響するかを評価する。
前方（原言語）と後方（翻訳）テストデータ間の差を定量化する。
過去の人間パリティ評価を再評価し、不正確さの原因を分析する。
機械翻訳の人間パリティ評価で用いられる検定の統計的検出力を分析する。
今後の機械翻訳評価の信頼性を高める実践的なチェックリストを提供する。

提案手法

過去のWMT評価における前方と後方のテストデータを、人間評価（Direct Assessment）とBLEU指標で比較する。
各システムの前方対後方スコアを計算し、スコア差を分析する。
機械翻訳評価に適切なサンプルサイズを推定するためのパワー分析を行う。
Hassan et al. (2018) の人間パリティ評価を、更新された方法論とより大きな n で再実行する。
BLEUスコアの解釈を検討するため、文レベルの変動を考慮して文長とユニグラム精度を分析する。
今後の機械翻訳評価設計に関する留意点をまとめたチェックリストを提示する。

実験結果

リサーチクエスチョン

RQ1テストデータの翻訳品質が、人間および自動指標の機械翻訳評価結果にどのように影響するか？
RQ2前方対後方のテストデータは、システムのランキングや人間パリティの認識にどの程度影響を与えるか？
RQ3機械翻訳の人間パリティ評価で用いられる有意性検定の統計的検出力はどの程度か、適切なサンプルサイズはどれか？
RQ4過去の高評価な評価を、方法とデータ処理を再検討することでより信頼性のあるものにできるか？

主な発見

逆に作成されたテストデータは、多くの言語ペアで前方データより高い人間評価スコアをもたらすことが多い。
前方と後方のテストデータ間のBLEUスコア差は、より高い逆方スコアと低い逆方スコアが混在し、文レベルのばらつきの影響を受ける。
システム対システムのBLEUおよびDAスコアの相対差は絶対スコアよりも安定しており、ランキングの解釈には慎重さが必要であることを支持する。
前方検定のパワー分析により、過去の人間パリティ評価が検出力不足の可能性が示され、Type IIエラーのリスクがあることが示唆される。
更新されたWMTの方法論とより大きな distinct translations のセットを用いた再評価は、人間パリティのより信頼性のある評価を提供し、依然として残る不正確さの原因を浮き彫りにする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。