QUICK REVIEW

[论文解读] MTQE.en-he: Machine Translation Quality Estimation for English-Hebrew

Andy Rosenbaum, Assaf Siani|arXiv (Cornell University)|Feb 6, 2026

Natural Language Processing Techniques被引用 0

一句话总结

本论文发布 MTQE.en-he，首个公开的英语-希伯来语MTQE基准，评估 ChatGPT 提示、TransQuest 与 CometKiwi，且集成（ensembling）优于任一单模型；轻量化微调进一步提升性能。

ABSTRACT

We release MTQE.en-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. MTQE.en-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. MTQE.en-he and our experimental results enable future research on this under-resourced language pair.

研究动机与目标

创建并发布一个公开可获取的英语-希伯来语 MTQE 数据集，并以三个人类标注的直接评估分数进行评测。
在 MTQE.en-he 上基线模型（ChatGPT 提示、TransQuest、CometKiwi）进行基线评测。
探索模型集成和参数高效微调以提升低资源语言对的 MTQE 性能。

提出的方法

从 WMT24++ 的 959 条英文段落覆盖四个领域构建 MTQE.en-he。
由三位本地水平专家对每个段落进行 Direct Assessment 分数标注并取平均作为 ground truth。
评测基线：ChatGPT 提示、TransQuest、CometKiwi；计算 Pearson 与 Spearman 相关性。
尝试对模型预测进行集成，以提升比单一最佳模型的准确度。
使用四种参数高效方法（LoRA、BitFit、FTHead）以及全量微调对 TransQuest 与 CometKiwi 进行微调以作对比。
提供五个数据切分的种子和可重复性说明。

实验结果

研究问题

RQ1公开发布的英语-希伯来语 MTQE 数据集是否能通过标准 QE 模型获得可靠的质量估计？
RQ2ChatGPT 提示、TransQuest 与 CometKiwi 在 MTQE.en-he 上的表现彼此相比如何？
RQ3跨模型的集成是否能提升 MTQE 的准确度，超过最佳单一模型？
RQ4轻量化、参数高效的微调方法是否能提升英语-希伯来语的 MTQE 性能？

主要发现

Model	Pearson All	Spearman All	Pearson Test	Spearman Test
ChatGPT-freestyle	0.4266	0.5018	0.4136	0.5020
ChatGPT-guidelines	0.4256	0.5074	0.4119	0.5087
TransQuest-multilingual	0.3759	0.4303	0.3608	0.4235
TransQuest-en-any	0.4327	0.4501	0.4205	0.4537
CometKiwi	0.4828	0.5456	0.4495	0.5305
Ensemble(GPT-f, TQ)	0.5028	0.5622	0.4876	0.5608
Ensemble(GPT-f, CK)	0.5211	0.5929	0.4992	0.5798
Ensemble(TQ, CK)	0.5081	0.5459	0.4810	0.5390
Ensemble(GPT-f, TQ, CK)	0.5472	0.6014	0.5250	0.5926
TQ+FullFT	-	-	0.4287	0.4608
TQ+LoRA	-	-	0.4445	0.4828
TQ+BitFit	-	-	0.4424	0.4799
TQ+FTHead	-	-	0.4358	0.4718
CK+FullFT	-	-	0.4236	0.5034
CK+LoRA	-	-	0.4670	0.5554
CK+BitFit	-	-	0.4647	0.5551
CK+FTHead	-	-	0.4693	0.5449

ChatGPT-freestyle、TransQuest 与 CometKiwi 的集成在全数据集（All）上取得最佳表现，Pearson 0.5472、Spearman 0.6014；在测试集上分别为 0.5250 与 0.5926。
单一最佳模型（CometKiwi）在 All 上达到 Pearson 0.4828、Spearman 0.5456；在 Test 上为 0.4495 与 0.5305。
ChatGPT 提示单独使用在 All 上约得到 Pearson 0.4266、Spearman 0.5018。
全量微调通常降低 TransQuest 的性能、对 CometKiwi 略有损害；而参数高效方法（LoRA、BitFit、FTHead）为两种模型提供约 2-3 个百分点的稳定增益。
使用 LoRA、BitFit 或 FTHead 进行微调能提高集成与单独模型的性能且不易过拟合，与显示分布崩溃的 FullFT 形成对比。
MTQE.en-he 基线结果与实验设置为未来在英语-希伯来语 QE 及低资源语言对的研究提供基础。

(b) English source word length distribution.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。