High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs10623-015-0087-1.pdf

High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers

High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers Michael Düll 0 1 3 4 5 6 7 Björn Haase 0 1 3 4 5 6 7 Gesine Hinterwälder 0 1 3 4 5 6 7 Michael Hutter 0 1 3 4 5 6 7 Christof Paar 0 1 3 4 5 6 7 Ana Helena Sánchez 0 1 3 4 5 6 7 Peter Schwabe 0 1 3 4 5 6 7 0 Michael Hutter 1 Michael Düll 2 Co. KG , Dieselstraße 24, 70839 Gerlingen , Germany 3 Digital Security Group, Radboud University , PO Box 9010, 6500 GL Nijmegen , The Netherlands 4 Cryptography Research , 425 Market Street, 11th Floor, San Francisco, CA 94105 , USA 5 Horst Görtz Institute for IT-Security, Ruhr-University Bochum , 44801 Bochum , Germany 6 Ana Helena Sánchez 7 Christof Paar This paper presents new speed records for 128-bit secure elliptic-curve DiffieHellman key-exchange software on three different popular microcontroller architectures. We B Peter Schwabe consider a 255-bit curve proposed by Bernstein known as Curve25519, which has also been adopted by the IETF. We optimize the X25519 key-exchange protocol proposed by Bernstein in 2006 for AVR ATmega 8-bit microcontrollers, MSP430X 16-bit microcontrollers, and for ARM Cortex-M0 32-bit microcontrollers. Our software for the AVR takes only 13,900,397 cycles for the computation of a Diffie–Hellman shared secret, and is the first to perform this computation in less than a second if clocked at 16 MHz for a security level of 128 bits. Our MSP430X software computes a shared secret in 5,301,792 cycles on MSP430X microcontrollers that have a 32-bit hardware multiplier and in 7,933,296 cycles on MSP430X microcontrollers that have a 16-bit multiplier. It thus outperforms previous constant-time ECDH software at the 128-bit security level on the MSP430X by more than a factor of 1.2 and 1.15, respectively. Our implementation on the Cortex-M0 runs in only 3,589,850 cycles and outperforms previous 128-bit secure ECDH software by a factor of 3. 1 Introduction A large and growing share of the world’s CPU market is formed by embedded microcontrollers. A surprisingly large number of embedded systems require security, e.g., electronic passports, smartphones, car-to-car communication and industrial control units. The continuously growing Internet of Things will only add to this development. It is of great interest to provide efficient cryptographic primitives for embedded CPUs, since virtually every security solution is based on crypto algorithms. Whereas symmetric algorithms are comparably efficient and some embedded microcontrollers even offer hardware support for them [3], asymmetric cryptography is notoriously computational intensive. Since the invention of elliptic-curve cryptography (ECC) in 1985, independently by Koblitz [26] and Miller [31], it has become the method of choice for many applications, especially in the embedded domain. Compared to schemes that are based on the hardness of integer factoring, most prominently RSA, and schemes based on the hardness of the discrete logarithm in the multiplicative group Zn∗, like the classical Diffie–Hellman key exchange or DSA, ECC offers significantly shorter public keys, faster computation times for most operations, and an impressive security record. For suitably chosen elliptic curves, the best attacks known today still have the same complexity as the best attacks known in 1985. Over the last one and half decade or so, various elliptic curves have been standardized for use in cryptographic protocols such as TLS. The most widely used standard for ECC are the NIST curves proposed by NSA’s Jerry Solinas and standardized in [33, Appendix D]. Various other curves have been proposed and standardized, for example the FRP256v1 curve by the French ANSSI [1], the Brainpool curves by the German BSI [30], or the SM2 curves proposed by the Chinese government [36]. It is known for quite a while that all of these standardized curves are not optimal from a performance perspective and that special cases in the group law complicate implementations that are at the same time correct, secure, and efficient. These disadvantages together with some concerns about how these curves were constructed—see, for example [10,37]—recently lead to increased interest in reconsidering the choice of elliptic curves for cryptography. As a consequence, in 2015 the IETF adopted two next-generation curves as draft internet standard for usage with TLS [25]. One of the promising next-generation elliptic curves now also adopted by the IETF is Curve25519. Curve25519 is already in use in various applications today and was originally proposed by Bernstein in 2006 [5]. Bernstein uses the Montgomery form of this curve for efficient, secure, and easy-to-implement elliptic-curve Diffie–Hellman key exchange. Originally, the name “Curve25519” referred to this key-exchange protocol, but Bernstein recently suggested to rename the scheme to X25519 and to use the name Curve25519 for the underlying elliptic curve [6]. We will adopt this new notation in this paper. Several works describe the excellent performance of this key-agreement scheme on large desktop and server processors, for example, the Intel Pentium M [5], the Cell Broadband Engine [13], ARM Cortex-A8 with NEON [7], or Intel Nehalem/Westmere [8,9]. 1.1 Contributions of this paper This paper presents implementation techniques of X25519 for three different, widely used embedded microcontrollers. All implementations are optimized for high speed, while executing in constant time, and they set new speed records for constant-time variable-base-point scalar multiplication at the 128-bit security level on the respective architectures. To some extent, the results presented here are based on earlier results by some of the authors. However, this paper does not merely collect those previous results, but significantly improves performance. Specifically, the software for the AVR ATmega family of microcontrollers presented in this paper takes only 13,900,397 cycles and is thus more than a factor of 1.6 faster than the X25519 software described by Hutter and Schwabe [23]. The X25519 implementation for MSP430Xs with 32-bit multiplier presented in this paper takes only 5,301,792 cycles and is thus more than a factor of 1.2 faster, whereas the implementation for MSP430Xs with 16-bit multiplier presented in this paper takes 7,933,296 cycles and is more than a factor of 1.15 faster than the software presented by Hinterwälder et al. [21]. Furthermore, this paper is the first to present a X25519 implementation optimized for the very widely used ARM Cortex-M0 architecture. The implementation requires only 3,589,850 cycles, which is a factor of 3 faster than the scalar multiplication on the NIST P-256 curve described by Wenger et al. [45]. 1.2 A note on side-channel protection All the software presented in this paper avoids secret-data-dependent branches and secretly indexed memory access and is thus inherently protected against timing attacks. Protection against power-analysis (and EM-analysis) attac (...truncated)