High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers
High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers
Michael Düll 0 1 3 4 5 6 7
Björn Haase 0 1 3 4 5 6 7
Gesine Hinterwälder 0 1 3 4 5 6 7
Michael Hutter 0 1 3 4 5 6 7
Christof Paar 0 1 3 4 5 6 7
Ana Helena Sánchez 0 1 3 4 5 6 7
Peter Schwabe 0 1 3 4 5 6 7
0 Michael Hutter
1 Michael Düll
2 Co. KG , Dieselstraße 24, 70839 Gerlingen , Germany
3 Digital Security Group, Radboud University , PO Box 9010, 6500 GL Nijmegen , The Netherlands
4 Cryptography Research , 425 Market Street, 11th Floor, San Francisco, CA 94105 , USA
5 Horst Görtz Institute for IT-Security, Ruhr-University Bochum , 44801 Bochum , Germany
6 Ana Helena Sánchez
7 Christof Paar
This paper presents new speed records for 128-bit secure elliptic-curve DiffieHellman key-exchange software on three different popular microcontroller architectures. We
B Peter Schwabe
consider a 255-bit curve proposed by Bernstein known as Curve25519, which has also been
adopted by the IETF. We optimize the X25519 key-exchange protocol proposed by Bernstein
in 2006 for AVR ATmega 8-bit microcontrollers, MSP430X 16-bit microcontrollers, and for
ARM Cortex-M0 32-bit microcontrollers. Our software for the AVR takes only 13,900,397
cycles for the computation of a Diffie–Hellman shared secret, and is the first to perform
this computation in less than a second if clocked at 16 MHz for a security level of 128
bits. Our MSP430X software computes a shared secret in 5,301,792 cycles on MSP430X
microcontrollers that have a 32-bit hardware multiplier and in 7,933,296 cycles on MSP430X
microcontrollers that have a 16-bit multiplier. It thus outperforms previous constant-time
ECDH software at the 128-bit security level on the MSP430X by more than a factor of 1.2
and 1.15, respectively. Our implementation on the Cortex-M0 runs in only 3,589,850 cycles
and outperforms previous 128-bit secure ECDH software by a factor of 3.
1 Introduction
A large and growing share of the world’s CPU market is formed by embedded
microcontrollers. A surprisingly large number of embedded systems require security, e.g., electronic
passports, smartphones, car-to-car communication and industrial control units. The
continuously growing Internet of Things will only add to this development. It is of great interest to
provide efficient cryptographic primitives for embedded CPUs, since virtually every
security solution is based on crypto algorithms. Whereas symmetric algorithms are comparably
efficient and some embedded microcontrollers even offer hardware support for them [3],
asymmetric cryptography is notoriously computational intensive.
Since the invention of elliptic-curve cryptography (ECC) in 1985, independently by
Koblitz [26] and Miller [31], it has become the method of choice for many applications,
especially in the embedded domain. Compared to schemes that are based on the hardness of
integer factoring, most prominently RSA, and schemes based on the hardness of the discrete
logarithm in the multiplicative group Zn∗, like the classical Diffie–Hellman key exchange
or DSA, ECC offers significantly shorter public keys, faster computation times for most
operations, and an impressive security record. For suitably chosen elliptic curves, the best
attacks known today still have the same complexity as the best attacks known in 1985. Over
the last one and half decade or so, various elliptic curves have been standardized for use in
cryptographic protocols such as TLS. The most widely used standard for ECC are the NIST
curves proposed by NSA’s Jerry Solinas and standardized in [33, Appendix D]. Various other
curves have been proposed and standardized, for example the FRP256v1 curve by the French
ANSSI [1], the Brainpool curves by the German BSI [30], or the SM2 curves proposed by
the Chinese government [36].
It is known for quite a while that all of these standardized curves are not optimal from a
performance perspective and that special cases in the group law complicate implementations
that are at the same time correct, secure, and efficient. These disadvantages together with
some concerns about how these curves were constructed—see, for example [10,37]—recently
lead to increased interest in reconsidering the choice of elliptic curves for cryptography. As a
consequence, in 2015 the IETF adopted two next-generation curves as draft internet standard
for usage with TLS [25]. One of the promising next-generation elliptic curves now also
adopted by the IETF is Curve25519. Curve25519 is already in use in various applications
today and was originally proposed by Bernstein in 2006 [5]. Bernstein uses the Montgomery
form of this curve for efficient, secure, and easy-to-implement elliptic-curve Diffie–Hellman
key exchange. Originally, the name “Curve25519” referred to this key-exchange protocol,
but Bernstein recently suggested to rename the scheme to X25519 and to use the name
Curve25519 for the underlying elliptic curve [6]. We will adopt this new notation in this
paper.
Several works describe the excellent performance of this key-agreement scheme on large
desktop and server processors, for example, the Intel Pentium M [5], the Cell Broadband
Engine [13], ARM Cortex-A8 with NEON [7], or Intel Nehalem/Westmere [8,9].
1.1 Contributions of this paper
This paper presents implementation techniques of X25519 for three different, widely
used embedded microcontrollers. All implementations are optimized for high speed,
while executing in constant time, and they set new speed records for constant-time
variable-base-point scalar multiplication at the 128-bit security level on the respective
architectures.
To some extent, the results presented here are based on earlier results by some of the
authors. However, this paper does not merely collect those previous results, but significantly
improves performance. Specifically, the software for the AVR ATmega family of
microcontrollers presented in this paper takes only 13,900,397 cycles and is thus more than a
factor of 1.6 faster than the X25519 software described by Hutter and Schwabe [23]. The
X25519 implementation for MSP430Xs with 32-bit multiplier presented in this paper takes
only 5,301,792 cycles and is thus more than a factor of 1.2 faster, whereas the
implementation for MSP430Xs with 16-bit multiplier presented in this paper takes 7,933,296 cycles
and is more than a factor of 1.15 faster than the software presented by Hinterwälder et al.
[21].
Furthermore, this paper is the first to present a X25519 implementation optimized for the
very widely used ARM Cortex-M0 architecture. The implementation requires only 3,589,850
cycles, which is a factor of 3 faster than the scalar multiplication on the NIST P-256 curve
described by Wenger et al. [45].
1.2 A note on side-channel protection
All the software presented in this paper avoids secret-data-dependent branches and secretly
indexed memory access and is thus inherently protected against timing attacks. Protection
against power-analysis (and EM-analysis) attac (...truncated)