[論文レビュー] Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models
The paper presents Error Analysis Prompting (EAPrompt), a two-stage prompting method that combines Chain-of-Thought and error analysis to enable LLMs to evaluate translation quality in a human-like, explainable way, achieving state-of-the-art system-level and competitive segment-level results across multiple language pairs.
Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks, such as machine translation, text summarization. Recent research (Kocmi and Federmann, 2023) has shown that utilizing LLMs for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but extit{performs poorly at the segment level}. To further improve the performance of LLMs on MT quality assessment, we investigate several prompting designs, and propose a new prompting method called extbf{ exttt{Error Analysis Prompting}} (EAPrompt) by combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al., 2023). This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM, Freitag et al. (2021)) and extit{produces explainable and reliable MT evaluations at both the system and segment level}. Experimental Results from the WMT22 metrics shared task validate the effectiveness of EAPrompt on various LLMs, with different structures. Further analysis confirms that EAPrompt effectively distinguishes major errors from minor ones, while also sharing a similar distribution of the number of errors with MQM. These findings highlight the potential of EAPrompt as a human-like evaluator prompting technique for MT evaluation.
研究の動機と目的
- Motivate the need for human-like, explainable MT evaluation beyond traditional metrics.
- Propose a prompting method (EAPrompt) that imitates MQM-style error analysis.
- Demonstrate that EAPrompt improves system-level and segment-level MT evaluation across multiple language pairs.
- Show that EAPrompt works in both reference-based and reference-less settings.
- Provide guidance on prompt design and cost-saving strategies for LLM-based evaluation.
提案手法
- Define MT evaluation using MQM-inspired error identification (major/minor errors).
- Combine Chain-of-Thought prompting with explicit Error Analysis to form EAPrompt.
- Use a two-step prompting process: (1) identify errors; (2) count errors and compute scores.
- Adopt one-shot in-context learning with language-pair-specific examples and itemized error demonstrations.
- Experiment with multiple LLMs and prompts on WMT22 data across En-De, En-Ru, Zh-En.
- Evaluate using system-level accuracy and segment-level accuracy with tie calibration (acc*).

実験結果
リサーチクエスチョン
- RQ1Can LLMs provide human-like translation quality evaluation with explicit error analysis?
- RQ2Does EAPrompt improve MT evaluation performance at system and segment levels across multiple language pairs?
- RQ3How does EAPrompt compare to GEMBA and other baselines in both reference-based and reference-less settings?
- RQ4What prompt designs (2-step vs 1-step, itemized errors) yield the best performance?
- RQ5Can inference-costs be reduced without significant performance loss (e.g., via regex-based counting)?
主な発見
| モデル | Repr? | All (3 LPs) | En-De | En-Ru | Zh-En |
|---|---|---|---|---|---|
| MetricsX-XXL | ✓ | 85.0 | 60.4 | 60.6 | 54.4 |
| BLEURT20 | ✓ | 84.7 | 56.8 | 54.0 | 48.9 |
| COMET22 | ✓ | 83.9 | 59.4 | 57.7 | 53.6 |
| UniTE | ✓ | 82.8 | 59.8 | 57.7 | 51.7 |
| COMET-QE | ✗ | 78.1 | 55.5 | 53.4 | 48.3 |
| UniTE-src | ✗ | 75.9 | 58.2 | 55.4 | 50.8 |
| MaTESe-QE | ✗ | 74.8 | 57.2 | 49.9 | 49.4 |
| Llama2-70b-Chat | GEMBA | 74.1 | 53.7 | 48.8 | 45.4 |
| EAPrompt | ✓ | 85.4 (+11.3) | 55.2 (+1.5) | 51.4 (+2.6) | 50.2 (+4.8) |
| GEMBA | ✗ | 72.6 | 54.1 | 47.8 | 45.0 |
| EAPrompt | ✗ | 85.8 (+13.2) | 55.0 (+0.9) | 51.6 (+3.8) | 49.3 (+4.3) |
| GPT-3.5-Turbo | GEMBA | 86.5 | 55.2 | 49.5 | 48.2 |
| EAPrompt | ✓ | 91.2 (+4.7) | 56.7 (+1.5) | 53.3 (+3.8) | 50.0 (+1.8) |
| GEMBA | ✗ | 86.9 | 54.7 | 50.0 | 47.6 |
| EAPrompt | ✗ | 89.4 (+2.5) | 55.7 (+1.0) | 53.4 (+3.4) | 48.8 (+1.2) |
- EAPrompt with GPT-3.5-Turbo achieves state-of-the-art system-level accuracy across three language pairs.
- EAPrompt outperforms GEMBA at the system level and surpasses GEMBA in 8 of 9 segment-level scenarios across tested LLMs and language pairs.
- EAPrompt remains effective in reference-less settings, maintaining strong performance without references.
- A 2-step prompt with itemized error demonstrations provides the best results among tested variants.
- Error distributions produced by EAPrompt align with MQM for major/minor errors, supporting human-like evaluation characteristics.
- Regular expression-based counting can maintain performance while reducing inference costs.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。