[論文レビュー] MTQE.en-he: Machine Translation Quality Estimation for English-Hebrew
The paper releases MTQE.en-he, the first public English-Hebrew MTQE benchmark, evaluates ChatGPT prompting, TransQuest, and CometKiwi, and shows ensembling improves over any single model; lightweight fine-tuning further boosts performance.
We release MTQE.en-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. MTQE.en-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. MTQE.en-he and our experimental results enable future research on this under-resourced language pair.
研究の動機と目的
- Create and release a publicly available English-Hebrew MTQE dataset evaluated with three human-annotated Direct Assessment scores.
- Benchmark baseline models (ChatGPT prompting, TransQuest, CometKiwi) on MTQE.en-he.
- Explore model ensembling and parameter-efficient fine-tuning to improve MTQE for a low-resource language pair.
提案手法
- Construct MTQE.en-he from 959 English segments from WMT24++ across four domains.
- Annotate each segment with Direct Assessment scores by three native-level experts and average to ground truth.
- Evaluate baselines: ChatGPT prompting, TransQuest, and CometKiwi; compute Pearson and Spearman correlations.
- Experiment with ensemble of model predictions to improve accuracy over best single model.
- Fine-tune TransQuest and CometKiwi using four parameter-efficient methods (LoRA, BitFit, FTHead) and full fine-tuning for comparison.
- Provide seeds and reproducibility notes for five data splits.

実験結果
リサーチクエスチョン
- RQ1Can a publicly released English-Hebrew MTQE dataset yield reliable quality estimations through standard QE models?
- RQ2How do ChatGPT prompting, TransQuest, and CometKiwi perform on MTQE.en-he compared to each other?
- RQ3Does ensembling across models improve MTQE accuracy beyond the best single model?
- RQ4Do lightweight, parameter-efficient fine-tuning methods improve MTQE performance for English-Hebrew?
主な発見
| Model | Pearson All | Spearman All | Pearson Test | Spearman Test |
|---|---|---|---|---|
| ChatGPT-freestyle | 0.4266 | 0.5018 | 0.4136 | 0.5020 |
| ChatGPT-guidelines | 0.4256 | 0.5074 | 0.4119 | 0.5087 |
| TransQuest-multilingual | 0.3759 | 0.4303 | 0.3608 | 0.4235 |
| TransQuest-en-any | 0.4327 | 0.4501 | 0.4205 | 0.4537 |
| CometKiwi | 0.4828 | 0.5456 | 0.4495 | 0.5305 |
| Ensemble(GPT-f, TQ) | 0.5028 | 0.5622 | 0.4876 | 0.5608 |
| Ensemble(GPT-f, CK) | 0.5211 | 0.5929 | 0.4992 | 0.5798 |
| Ensemble(TQ, CK) | 0.5081 | 0.5459 | 0.4810 | 0.5390 |
| Ensemble(GPT-f, TQ, CK) | 0.5472 | 0.6014 | 0.5250 | 0.5926 |
| TQ+FullFT | - | - | 0.4287 | 0.4608 |
| TQ+LoRA | - | - | 0.4445 | 0.4828 |
| TQ+BitFit | - | - | 0.4424 | 0.4799 |
| TQ+FTHead | - | - | 0.4358 | 0.4718 |
| CK+FullFT | - | - | 0.4236 | 0.5034 |
| CK+LoRA | - | - | 0.4670 | 0.5554 |
| CK+BitFit | - | - | 0.4647 | 0.5551 |
| CK+FTHead | - | - | 0.4693 | 0.5449 |
- Ensemble of ChatGPT-freestyle, TransQuest, and CometKiwi achieves the best performance with Pearson 0.5472 and Spearman 0.6014 on the full dataset (All) and 0.5250 and 0.5926 on the test set.
- Single best model (CometKiwi) achieves Pearson 0.4828 and Spearman 0.5456 (All) and 0.4495 and 0.5305 (Test).
- ChatGPT prompts alone yield around 0.4266 Pearson and 0.5018 Spearman (All).
- Full fine-tuning generally degrades performance for TransQuest and slightly harms CometKiwi, while parameter-efficient methods (LoRA, BitFit, FTHead) provide stable gains of about 2-3 percentage points for both models.
- Fine-tuning with LoRA, BitFit, or FTHead improves ensemble and individual model performance without overfitting, in contrast to FullFT which shows distribution collapse.
- MTQE.en-he baseline results and experimental setups enable further research on English-Hebrew QE and low-resource language pairs.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。