Learning Accurate Integer Transformer Machine-Translation Models (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s42979-021-00688-4.pdf

Learning Accurate Integer Transformer Machine-Translation Models

SN Computer Science (2021) 2:291 https://doi.org/10.1007/s42979-021-00688-4 ORIGINAL RESEARCH Learning Accurate Integer Transformer Machine‑Translation Models Ephrem Wu1 Received: 13 June 2020 / Accepted: 10 May 2021 © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2021 Abstract We describe a method for training accurate Transformer machine-translation models to run inference using 8-bit integer (INT8) hardware matrix multipliers, as opposed to the more costly single-precision floating-point (FP32) hardware. Unlike previous work, which converted only 85 Transformer matrix multiplications to INT8, leaving 48 out of 133 of them in FP32 because of unacceptable accuracy loss, we convert them all to INT8 without compromising accuracy. Tested on the newstest2014 English-to-German translation task, our INT8 Transformer Base and Transformer Big models yield BLEU scores that are 99.3–100% relative to those of the corresponding FP32 models. Our approach converts all matrix-multiplication tensors from an existing FP32 model into INT8 tensors by automatically making range-precision trade-offs during training. To demonstrate the robustness of this approach, we also include results from INT6 Transformer models. Keywords Machine learning · Machine translation · Transformer · Low-precision inference · Natural language processing Introduction We report a method for training accurate and yet compact Transformer machine-translation inference models [21]. Specifically, we aim these models at inference hardware with 8-bit integer (INT8) matrix multipliers. Compared to single-precision floating-point (FP32) matrix multiplications, INT8 matrix multiplications not only reduce both storage and bandwidth four times, but they also consume 15 times less energy [9]. We, therefore, have two goals: (1) to convert all matrix multiplications from FP32 to INT8 for inference, thereby reducing parameter and non-parameter tensor sizes fourfold, and (2) to maintain translation accuracy relative to the FP32 model. The Transformer model has proven to be a powerful attention-based model for machine translation [1], and has inspired much follow-up research in this area. For example, the term “Transformer” appears 56 times in the findings of the 2018 Conference of Machine Translation (WMT18) [6] and 105 times in WMT19 [2]. The Transformer model in [21] comes in two forms, Transformer Base, which has 61 million parameters, and Transformer Big, which has 210 million parameters. These model sizes are large compared to * Ephrem Wu 1 Xilinx, Inc., San Jose, USA the convolutional neural network benchmark ResNet-50 [8], a 25M-parameter model, but still small compared to OpenAI GPT-2 models [16], a family of Transformer models with 117M, 345M, 774M, and 1.6B parameters. As of this writing (January 2020), Transformer model sizes show no signs of reduction. For instance, [10] and [17] report multi-lingual machine-translation results even for larger Transformer models for up to 11 billion parameters. It is, therefore, useful to explore techniques for reducing inference parameter representation costs rather than holding out for smaller Transformer architectures. Related Work We draw inspiration from three papers on converting FP32 machine-translation models into INT8 models for inference. The first paper, [23], was published before the Transformer [21]. The authors of this paper quantized only parameter tensors but not non-parameter tensors in a range-preserving fashion. The second paper, [13], reported INT8 Transformer Big English-to-German translation results but did not report any results for INT8 Transformer Base. The third paper, [4], did the opposite: it reported INT8 Transformer Base English-to-German translation results but not for Transformer Big. Because the last two papers both study INT8 Transformer models, we compare their reported BLEU scores SN Computer Science Vol.:(0123456789) 291 Page 2 of 8 Table 1 English-to-German translation BLEU scores for FP32 (baseline) vs. INT8 Transformer models using newstest2014 SN Computer Science Neural translation model Transformer Base [4] Transformer Base (This Work) Transformer Big [13] Transformer Big (This Work) to ours in Table 1. Note that although our FP32 baseline models have higher BLEU scores than [4, 13], our INT8 models exhibit less degradation in BLEU scores in absolute terms, and, therefore, in relative terms as well. The BLEU score in our FP32 Transformer Base model (27.8) is quite close to that in [4] (27.7) because we both used the same tensor2tensor Transformer code [20]. However, the BLEU score of our FP32 Transformer Big model (29.6) is much higher than that in [13] (28.1). We suspect that our higher BLEU score is due to a wider beam size of four during beam search [19] at inference time (the default in the tensor2tensor code) while the BLEU score in [13] is based on a beam size of one. Because of these modeling differences, and possibly others that various authors did not disclose, we suggest focusing on the BLEU score differences between FP32 baseline models and their INT8 derivatives, in other words, the last two columns in Table 1. Before the Transformer paper was published, [23] described a quantization-aware training method for an attention-based LSTM neural machine translator. They treated parameter tensors differently from non-parameter tensors. In particular, they quantized parameter tensors to INT8 in a range-preserving fashion. For non-parameters, they treated logits differently from other non-parameter tensors. In more detail, they clipped logits to [−25.0, 25.0] and other non-parameter tensors to [−8.0, 8.0], before annealed them to [−1.0, 1.0] by the end of the training. In the LSTM module, matrix multiplication operands were 8-bit integers, and accumulators were 16-bit integers. All other operations in the LSTM module were 16-bit operations. The softmax function and the attention mechanism remained as floatingpoint operations. These authors reported an English-toGerman translation BLEU score of 24.61 (newstest2014). Similarly, we quantized parameters to 8-bit integers in a range-preserving manner. We attempted to clip floatingpoint non-parameter tensors before uniform quantization but observed that clipping after rounding was simpler and yielded accurate Transformer models. Furthermore, we did not manually select clipping ranges for non-parameters. Our training method automatically adjusts clipping ranges to make range-precision trade-offs. Based on a literature overview, we believe that Microsoft’s Marian team [12] was the first to publish newstest2014 English-to-German translation BLEU scores using integer SN Computer Science BLEU score (2021) 2:291 Difference FP32 model INT8 model Points Relative (%) 27.7 27.8 28.1 29.6 27.3 27.8 27.5 29.5 – 0.4 0.0 – 0.6 – 0.1 – 1.44 0.00 – 2.14 – 0.34 Transformer models [13]. The FP32 Transformer Big model achieved a BLEU score of 28.1, and the (...truncated)