QUICK REVIEW

[Paper Review] The Sockeye 2 Neural Machine Translation Toolkit at AMTA 2020

Tobias Domhan, Michael Denkowski|arXiv (Cornell University)|Aug 11, 2020

Natural Language Processing Techniques38 references68 citations

TL;DR

Sockeye 2 is a Gluon MXNet-based NMT toolkit that accelerates training and inference with state-of-the-art Transformer models, 8-bit CPU quantization, and mixed-precision training for research and production.

ABSTRACT

We present Sockeye 2, a modernized and streamlined version of the Sockeye neural machine translation (NMT) toolkit. New features include a simplified code base through the use of MXNet's Gluon API, a focus on state of the art model architectures, distributed mixed precision training, and efficient CPU decoding with 8-bit quantization. These improvements result in faster training and inference, higher automatic metric scores, and a shorter path from research to production.

Motivation & Objective

Introduce Sockeye 2 as a streamlined MXNet Gluon-based NMT toolkit.
Present improvements in model architectures, training speed, and inference efficiency.
Demonstrate 8-bit quantization for CPU decoding and its impact on latency and BLEU.
Showcase training enhancements via Horovod and automatic mixed precision.
Provide evidence from experiments on Transformer variants, source factors, and robustness.

Proposed method

Adopt Gluon API to simplify code and enable flexible execution modes (eager vs cached graphs).
Experiment with state-of-the-art Transformer architectures, including deep encoder/decoder configurations.
Introduce source factors and various embedding combinations to improve robustness to input variation.
Implement 8-bit quantization for CPU inference to reduce latency with minimal BLEU loss.
Integrate Horovod for distributed training and AMP for mixed precision to scale training.
Introduce a plateau-reduce learning schedule to improve training efficiency and final model quality.

Experimental results

Research questions

RQ1How does Sockeye 2 perform with state-of-the-art Transformer architectures compared to prior Sockeye versions?
RQ2What is the impact of 8-bit CPU quantization on decoding latency and BLEU scores across configurations?
RQ3Do source factors improve robustness to case and orthographic variations, and which embedding strategies work best?
RQ4How effective is Horovod-based distributed training and mixed-precision training for large-scale NMT models, and how does plateau-reduce scheduling compare to prior schedules?

Key findings

Transformer variants with deeper encoders and shallower decoders can yield competitive BLEU with substantially lower decoding latency.
8-bit quantization significantly reduces non-batched decoding times on CPUs with minimal BLEU degradation.
Source factors for input case information improve robustness to case variation, with certain factor strategies performing best in experiments.
Plateau-reduce training yields strong BLEU scores with shorter training times compared to the Ott et al. (2018) setup in the reported benchmarks.
Horovod-enabled distributed training with AMP can improve training efficiency, allowing larger effective batch sizes and faster convergence.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.