QUICK REVIEW

[论文解读] CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task

Ricardo Rei, Marcos Treviso|arXiv (Cornell University)|Sep 13, 2022

Topic Modeling被引用 33

一句话总结

CometKiwi 将 Comet 和 OpenKiwi 架构结合，用于解决 WMT 2022 QE 任务，显示出强大的多语言泛化、有效的少样本自适应，以及融合注意力与梯度的创新可解释性方法。

ABSTRACT

We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated on all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictor-estimator architecture of OpenKiwi, and equipping it with a word-level sequence tagger and an explanation extractor. Our results suggest that incorporating references during pretraining improves performance across several language pairs on downstream tasks, and that jointly training with sentence and word-level objectives yields a further boost. Furthermore, combining attention and gradient information proved to be the top strategy for extracting good explanations of sentence-level QE models. Overall, our submissions achieved the best results for all three tasks for almost all language pairs by a considerable margin.

研究动机与目标

通过将 IST 与 Unbabel 的联合提交，推动多语言质量估计（QE）进入 WMT 2022 QE 共享任务。
利用 Comet 的框架结合 OpenKiwi 的 predictor–estimator 进行句子级与词汇级 QE。
研究在参考丰富数据上的预训练以及对未见语言的 Few-shot 自适应。
开发基于注意力–梯度的可解释 QE，并通过 Head Mix 对注意力头进行加权以获得更好的解释。
在三个 QE 子任务（句子级、词汇级、可解释 QE）上展示强力、稳定的改进。

提出的方法

在 Comet 框架中扩展 predictor–estimator 架构并加入词级序列标注器。
在 Metrics 共享任务的 Direct Assessments (DAs) 中进行参考增强目标的预训练。
在 MLQE-PE 与 MQM 数据上进行微调；尝试多种多语言骨干网络（XLM-R、InfoXLM、RemBERT）。
实现一个句子级与词级联合训练目标，使用组合损失以提升多语言泛化。
通过将注意力与 GradNorm 相结合开发可解释 QE，并引入 Head Mix 模块以对注意力头进行加权以获得更好的解释。
对词汇级任务使用语言前缀标记以辅助零-shot 自适应，并对多模型进行集成以提升鲁棒性。
使用直接评估和 MQM 对语言对进行评估，并在可解释 QE 与关键错误检测中比较受限与不受限设置。

实验结果

研究问题

RQ1混合的 Comet–OpenKiwi 架构结合词级标注是否能在句子级与词级任务上提升多语言 QE 的性能？
RQ2对参考丰富的度量数据进行预训练并在预训练阶段包含参考是否能提升下游 QE 在不同语言对上的表现？
RQ3仅使用 500 个样本的 Few-shot 自适应是否能推广到未见语言对且不损害已见语言对的性能？
RQ4梯度增强的注意力解释与头部感知聚合是否在跨语言对和零样本情形中提升可解释 QE 的表现？
RQ5跨编码器与监督信号的集成策略在 DA 与 MQM 数据上的表现如何？

主要发现

Encoder	km-en	ps-en	en-ja	en-cs	en-mr	ru-en	ro-en	en-zh	en-de	et-en	si-en	ne-en	avg.
最终集成	0.666	0.669	0.380	0.591	0.593	0.782	0.871	0.597	0.593	0.845	0.588	0.820	0.666

六种多语言系统的集成在句子级 DA 上达到 state-of-the-art 的 Spearman 0.572（相比第二名提升 ~7%）。
词汇级 MCC 达到 0.341，超过第二名约 2.4%。
可解释 QE 的 R@K 达到 0.486，较第二名系统提升约 10%。
在 Metrics 数据上的预训练并在训练中包含参考资料，提升了多语言对下下游 QE 的相关性。
用 500 个样本进行的少样本自适应在未见语言对上带来 2–3% 的增益，同时不损害已见语言对的相关性。
Attention × GradNorm 结合 Head Mix 提供更优的解释，并有助于识别零-shot 语言的有效头部。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。