QUICK REVIEW

[论文解读] Findings of the WMT 2024 Shared Task on Chat Translation

Mohammed, Wafaa, António V. Lopes|arXiv (Cornell University)|Oct 15, 2024

Natural Language Processing Techniques参考文献 37被引用 20

一句话总结

本论文报告了 Chat Translation Shared Task 的第三版，评估六对语言对的双语客户支持聊天在上下文感知翻译方面的表现，人工和自动评估均显示上下文有帮助，但会话层面的质量仍具挑战性。

ABSTRACT

This paper presents the findings from the third edition of the Chat Translation Shared Task. As with previous editions, the task involved translating bilingual customer support conversations, specifically focusing on the impact of conversation context in translation quality and evaluation. We also include two new language pairs: English-Korean and English-Dutch, in addition to the set of language pairs from previous editions: English-German, English-French, and English-Brazilian Portuguese. We received 22 primary submissions and 32 contrastive submissions from eight teams, with each language pair having participation from at least three teams. We evaluated the systems comprehensively using both automatic metrics and human judgments via a direct assessment framework. The official rankings for each language pair were determined based on human evaluation scores, considering performance in both translation directions--agent and customer. Our analysis shows that while the systems excelled at translating individual turns, there is room for improvement in overall conversation-level translation quality.

研究动机与目标

Promote research on MT for conversational customer support chats and assess the impact of conversation context on translation quality.
Expand language coverage to include en-ko and en-nl in addition to en-de, en-fr, and en-pt, and provide curated evaluation sets emphasizing context usage.
Evaluate translation quality using both automatic metrics and human judgments, including discourse-aware analysis and LLM-based error assessments.
Analyze how context integration methods (summaries, graphs, raw context) affect translation, and identify strengths/limitations of current approaches in dialogue scenarios.

提出的方法

Provide MAIA 2.0 corpus data for training, development, and test splits with context-annotated conversations.
Use automatic metrics (Comet, BLEU, chrF, ContextCometQE) and MuDA discourse tagging to evaluate context handling.
Conduct human evaluation with Direct Assessment and Scalar Quality Metrics (DA+SQM) via Appraise, assessing turn- and conversation-level quality.
Perform LLM-based ContextMQM error analyses on en-de to categorize minor/major/critical errors.
Compare primary and contrastive systems from eight teams, many leveraging LLM-based finetuning, RAG-like context usage, and context-aware decoding (MBR/quality-aware decoding).

实验结果

研究问题

RQ1How does incorporating previous turns and different context representations affect translation quality in chat conversations across multiple language pairs?
RQ2Which system architectures and decoding strategies yield the best performance for agent and customer translations in chat settings?
RQ3What are the strengths and limitations of automatic metrics versus human judgments in capturing conversation-level translation quality?
RQ4How do discourse phenomena (pronoun resolution, formality, lexical cohesion, verb form consistency) correlate with evaluated quality across language pairs?
RQ5What is the impact of using context-aware evaluation methods (ContextCometQE, ContextMQM) on understanding system performance?

主要发现

Contextual information from prior turns generally improves translation quality across language pairs.
Human evaluations show high turn-level quality but more variability at the conversation level, indicating room for dialogue-level improvements.
Unbabel-IT achieved strong performance across most pairs and criteria, with HW-TSC leading on en-de according to automatic metrics.
Context-aware decoding and MBR-based strategies correlate with higher automatic metrics, but do not always align with human judgments.
Pronoun and formality handling can differ by language, influencing discourse accuracy in evaluations.
LLM-based ContextMQM analysis indicates Unbabel-IT often yields fewer errors, while some teams exhibit higher minor/major/critical error counts.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。