QUICK REVIEW

[論文レビュー] To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Tom Kocmi, Christian Federmann|arXiv (Cornell University)|Jul 22, 2021

Natural Language Processing Techniques参考文献 44被引用数 82

ひとこと要約

この論文は、対になるシステムランキングのための人間の判断に対してMT自動指標を大規模に評価し、 pretrained 指標（特に COMET と COMET-src）が文字列ベース指標を上回ることを示し、指標の使用に関する最良実践を提案している。

ABSTRACT

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

研究の動機と目的

対になるシステムランキングにおける自動 MT 指標が人間の判断をどれだけ信頼性高く予測するかを評価する。
言語ペア、ドメイン、方向性を比較し、指標の頑健性を評価する。
事前学習済み指標が従来の文字列ベース指標を設定全体で上回るかを判断する。
研究と実運用で自動 MT 指標を使用する際の実践的なベストプラクティスを提供する。

提案手法

人間の判断の最大規模公開コレクションを収集する（2.3M 判断、4380 システムに跨る）。
人間の判断に対する二値のペアワイズ精度を主要評価指標として定義する。
ペアワイズのシステム差異に対して、文字列ベースおよび事前学習済みを含む一連の自動指標を評価する。
有意性と信頼性を評価するために Wilcoxon の符号順位検定とブートストラップ再抽出を用いる。
言語方向、非英語ケース、およびドメイン全体での性能を分析し、頑健性を検証する。

実験結果

リサーチクエスチョン

RQ1どの自動 MT 指標が MT システムの人間ベースのペアワイズランキングを最もよく予測するか？
RQ2指標は言語方向、非英語言語、異なるドメインでどのように機能するか？
RQ3ペアワイズ決定の指標の信頼性に対する統計的有意性検定の影響は何か？
RQ4BLEU への依存は研究開発を偏らせるのか、事前学習済み指標はこの偏りを緩和できるか？

主な発見

事前学習済み指標は、ペアワイズ・システムランキングにおいて一般に文字列ベース指標を上回し、COMET が最高の精度を達成する。
COMET-src も高い性能を示し、人間のリファレンスを使用せずしても良好。
文字列ベース指標の中では、ChrF がペアワイズランキング精度で BLEU を上回る。
ペア済の有意性検定（ブートストラップ）を使用すると、指標間でランキングの信頼性が大幅に向上する。
BLEU はしばしば不適切な判断につながり、モデル開発に悪影響を与える可能性がある一方、事前学習済み指標は言語とドメインを横断して頑健性を示す。
精度は、強く異なるシステムでも100%を下回り続け、自動指標が人間評価を完全に置換できないことを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。