Skip to main content
QUICK REVIEW

[論文レビュー] Incorporating BERT into Neural Machine Translation

Jinhua Zhu, Yingce Xia|arXiv (Cornell University)|Feb 17, 2020
Topic Modeling参考文献 31被引用数 173
ひとこと要約

BERT-fusedモデルを提案。Attentionを介してTransformerベースのNMTシステムのすべてのエンコーダ/デコーダ層にBERT表現を注入し、複数の基準データセットで監視付き、半監視、無監視のMTタスクにおいて最先端の結果を達成する。

ABSTRACT

The recently proposed BERT has shown great power on a variety of natural language understanding tasks, such as text classification, reading comprehension, etc. However, how to effectively apply BERT to neural machine translation (NMT) lacks enough exploration. While BERT is more commonly used as fine-tuning instead of contextual embedding for downstream language understanding tasks, in NMT, our preliminary exploration of using BERT as contextual embedding is better than using for fine-tuning. This motivates us to think how to better leverage BERT for NMT along this direction. We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms. We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets. Our code is available at \url{https://github.com/bert-nmt/bert-nmt}.

研究の動機と目的

  • Motivate leveraging BERT for neural machine translation without training BERT from scratch.
  • Develop a BERT-fused model that connects BERT representations to all NMT layers via attention.
  • Improve translation quality across low-resource and high-resource settings, including document-level and semi-supervised scenarios.
  • Evaluate the approach on multiple language pairs and MT paradigms (supervised, semi-supervised, unsupervised).

提案手法

  • Obtain BERT representations for input sequences and fuse them with each encoder/decoder layer using dual attention mechanisms (BERT-encoder and BERT-decoder attention).
  • Compute fused layer representations with a 2-way attention scheme combining standard NMT attention and BERT-derived attention.
  • Introduce a drop-net regularization to encourage balanced use of BERT and NMT features during training.
  • Train in stages: pre-train NMT, then initialize with the trained NMT while freezing BERT and adding BERT-fusion components.
  • Apply document-level inputs by concatenating preceding context sentences into the BERT representations to enhance translation coherence.
  • Evaluate with BLEU across supervised, semi-supervised (back-translation), and unsupervised MT settings.

実験結果

リサーチクエスチョン

  • RQ1Can pre-trained BERT representations, when fused into all NMT layers via attention, improve translation quality across language pairs?
  • RQ2Does leveraging BERT as contextual embeddings outperform simply initializing NMT with BERT or using BERT as input embeddings alone?
  • RQ3How does the BERT-fused approach perform in low-resource vs. high-resource settings, including document-level and semi-supervised scenarios?
  • RQ4What is the impact of the drop-net regularization on generalization and performance?
  • RQ5Can the method achieve state-of-the-art results in unsupervised MT tasks?

主な発見

  • BERT-fused models outperform standard Transformer baselines across all tested IWSLT and WMT tasks, with BLEU gains ranging from about 1.5 to 2.8 on several language pairs.
  • On IWSLT’14 De→En, the method achieves a new record BLEU of 36.11, surpassing prior results.
  • On WMT’14 En→De and En→Fr, BLEU scores reach 30.75 and 43.78 respectively, outperforming baselines and several contemporary models.
  • In semi-supervised Ro→En, the approach achieves 39.10 BLEU, surpassing XLM and prior back-translation baselines.
  • In unsupervised En↔Fr and En↔Ro translation, the method achieves state-of-the-art BLEU scores (38.27/35.62/36.02/33.20 for the four tasks).
  • Document-level translation with BERT-fusion further improves De→En to 36.69 BLEU, showing effectiveness for cross-sentence context.]

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。