Learning Accurate Integer Transformer Machine-Translation Models
SN Computer Science
(2021) 2:291
https://doi.org/10.1007/s42979-021-00688-4
ORIGINAL RESEARCH
Learning Accurate Integer Transformer Machine‑Translation Models
Ephrem Wu1
Received: 13 June 2020 / Accepted: 10 May 2021
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2021
Abstract
We describe a method for training accurate Transformer machine-translation models to run inference using 8-bit integer
(INT8) hardware matrix multipliers, as opposed to the more costly single-precision floating-point (FP32) hardware. Unlike
previous work, which converted only 85 Transformer matrix multiplications to INT8, leaving 48 out of 133 of them in FP32
because of unacceptable accuracy loss, we convert them all to INT8 without compromising accuracy. Tested on the newstest2014 English-to-German translation task, our INT8 Transformer Base and Transformer Big models yield BLEU scores
that are 99.3–100% relative to those of the corresponding FP32 models. Our approach converts all matrix-multiplication
tensors from an existing FP32 model into INT8 tensors by automatically making range-precision trade-offs during training.
To demonstrate the robustness of this approach, we also include results from INT6 Transformer models.
Keywords Machine learning · Machine translation · Transformer · Low-precision inference · Natural language processing
Introduction
We report a method for training accurate and yet compact
Transformer machine-translation inference models [21].
Specifically, we aim these models at inference hardware with
8-bit integer (INT8) matrix multipliers. Compared to single-precision floating-point (FP32) matrix multiplications,
INT8 matrix multiplications not only reduce both storage
and bandwidth four times, but they also consume 15 times
less energy [9]. We, therefore, have two goals: (1) to convert
all matrix multiplications from FP32 to INT8 for inference,
thereby reducing parameter and non-parameter tensor sizes
fourfold, and (2) to maintain translation accuracy relative to
the FP32 model.
The Transformer model has proven to be a powerful
attention-based model for machine translation [1], and has
inspired much follow-up research in this area. For example,
the term “Transformer” appears 56 times in the findings of
the 2018 Conference of Machine Translation (WMT18) [6]
and 105 times in WMT19 [2]. The Transformer model in
[21] comes in two forms, Transformer Base, which has 61
million parameters, and Transformer Big, which has 210
million parameters. These model sizes are large compared to
* Ephrem Wu
1
Xilinx, Inc., San Jose, USA
the convolutional neural network benchmark ResNet-50 [8],
a 25M-parameter model, but still small compared to OpenAI
GPT-2 models [16], a family of Transformer models with
117M, 345M, 774M, and 1.6B parameters. As of this writing
(January 2020), Transformer model sizes show no signs of
reduction. For instance, [10] and [17] report multi-lingual
machine-translation results even for larger Transformer models for up to 11 billion parameters. It is, therefore, useful to
explore techniques for reducing inference parameter representation costs rather than holding out for smaller Transformer architectures.
Related Work
We draw inspiration from three papers on converting FP32
machine-translation models into INT8 models for inference.
The first paper, [23], was published before the Transformer
[21]. The authors of this paper quantized only parameter
tensors but not non-parameter tensors in a range-preserving
fashion. The second paper, [13], reported INT8 Transformer
Big English-to-German translation results but did not report
any results for INT8 Transformer Base. The third paper, [4],
did the opposite: it reported INT8 Transformer Base English-to-German translation results but not for Transformer
Big. Because the last two papers both study INT8 Transformer models, we compare their reported BLEU scores
SN Computer Science
Vol.:(0123456789)
291
Page 2 of 8
Table 1 English-to-German
translation BLEU scores for
FP32 (baseline) vs. INT8
Transformer models using
newstest2014
SN Computer Science
Neural translation model
Transformer Base [4]
Transformer Base (This Work)
Transformer Big [13]
Transformer Big (This Work)
to ours in Table 1. Note that although our FP32 baseline
models have higher BLEU scores than [4, 13], our INT8
models exhibit less degradation in BLEU scores in absolute
terms, and, therefore, in relative terms as well. The BLEU
score in our FP32 Transformer Base model (27.8) is quite
close to that in [4] (27.7) because we both used the same
tensor2tensor Transformer code [20]. However, the
BLEU score of our FP32 Transformer Big model (29.6) is
much higher than that in [13] (28.1). We suspect that our
higher BLEU score is due to a wider beam size of four during beam search [19] at inference time (the default in the
tensor2tensor code) while the BLEU score in [13] is
based on a beam size of one. Because of these modeling
differences, and possibly others that various authors did not
disclose, we suggest focusing on the BLEU score differences
between FP32 baseline models and their INT8 derivatives,
in other words, the last two columns in Table 1.
Before the Transformer paper was published, [23]
described a quantization-aware training method for an
attention-based LSTM neural machine translator. They
treated parameter tensors differently from non-parameter
tensors. In particular, they quantized parameter tensors to
INT8 in a range-preserving fashion. For non-parameters,
they treated logits differently from other non-parameter tensors. In more detail, they clipped logits to [−25.0, 25.0] and
other non-parameter tensors to [−8.0, 8.0], before annealed
them to [−1.0, 1.0] by the end of the training. In the LSTM
module, matrix multiplication operands were 8-bit integers,
and accumulators were 16-bit integers. All other operations
in the LSTM module were 16-bit operations. The softmax
function and the attention mechanism remained as floatingpoint operations. These authors reported an English-toGerman translation BLEU score of 24.61 (newstest2014).
Similarly, we quantized parameters to 8-bit integers in a
range-preserving manner. We attempted to clip floatingpoint non-parameter tensors before uniform quantization
but observed that clipping after rounding was simpler and
yielded accurate Transformer models. Furthermore, we did
not manually select clipping ranges for non-parameters.
Our training method automatically adjusts clipping ranges
to make range-precision trade-offs.
Based on a literature overview, we believe that Microsoft’s Marian team [12] was the first to publish newstest2014
English-to-German translation BLEU scores using integer
SN Computer Science
BLEU score
(2021) 2:291
Difference
FP32 model
INT8 model
Points
Relative (%)
27.7
27.8
28.1
29.6
27.3
27.8
27.5
29.5
– 0.4
0.0
– 0.6
– 0.1
– 1.44
0.00
– 2.14
– 0.34
Transformer models [13]. The FP32 Transformer Big model
achieved a BLEU score of 28.1, and the (...truncated)