Keywords

1 Introduction

Elliptic Curve Cryptography (ECC) has become the de facto standard for instantiating public key cryptography, with security based on the conjectured-as-exponential hardness of solving discrete logarithms over elliptic curve groups (ECDLP problem). Scalar multiplication, in particular for the unknown point scenario, is the most expensive operation in cryptographic protocols with security guarantees based on the ECDLP. Since its introduction in 1985, there was plenty of research in finding efficient and secure implementation strategies, and choosing optimal parameters to improve performance [3, 6, 22].

An early milestone in this research was the idea due to Gallant-Lambert-Vanstone (GLV) [14] of exploiting efficient endomorphisms to accelerate scalar multiplication. In the large characteristic case with the curve \(E : y^2 = x^3 + b\) defined over \(\mathbb {F}_p\) for prime p, it initially manifested as evaluating \(\psi : (x,y) \rightarrow (\beta x, y)\) for \(\beta \) a non-trivial cube root of unity. The technique was later generalized to Galbraith–Lin–Scott (GLS) curves defined over \(\mathbb {F}_{p^2}\), and to exploit two or more endomorphisms as in the Four\(\mathbb {Q}\) curve [25] and the genus-2 case [5]. In the characteristic 2 case, Koblitz curves are the classical example of curves equipped with endomorphisms [29]; a class later extended to include binary GLS curves [17].

Beyond performance, implementation security is also relevant, especially on embedded targets where side-channel attacks are more feasible. The classical countermeasure against these attacks is to formulate the arithmetic in a regular way, such as constant-time implementation against timing attacks. This translates to removing secret-dependent branches and memory accesses, and employing complete point addition formulas without corner cases [30]. Additional care needs to be taken for a correct and secure implementation when exploiting endomorphisms, such as using the GLV-SAC recoding technique [11] or by explicitly proving correctness of the specific scalar multiplication algorithm [6, 10].

In this paper, we revisit the implementation of scalar multiplication in binary GLS curves at the 128-bit security level by improving efficiency and correctness of implementations of the GLS254 curve. Our contributions are:

  • A variant of the GLS scalar multiplication algorithm that changes the computation/storage trade-off to use a 2D table. The resulting 2DT algorithm spends more precomputation to reduce point additions in the main loop.

  • The first proof of correctness for the GLS scalar multiplication algorithm with the \(\lambda \)-projective group law from [28]. We show that there are no corner cases in the main loop of the algorithm (with exception of possibly the last iteration), which enables the use of faster incomplete formulas.

  • Faster dedicated formulas to reduce the cost of precomputation, and a table compression technique that exploits the endomorphism to reduce storage.

  • A refined scalar decomposition algorithm that can be easily implemented in constant time. The algorithm has robust parity and length guarantees that fill some gaps in previous works [28].

  • An efficient formulation of arithmetic in \(\mathbb {F}_{2^{254}}\) targeting Arm processors. The field arithmetic uses the interleaved representation proposed in the CHES’16 Rump Session [27]. We take the opportunity to include this formulation in the formal research literature, initially presented informally. Furthermore, this also closes affirmatively a question posed in [21] about the efficiency of binary curves in Armv8 processors, deemed “unclear”.

With these contributions, we obtain speed records in the 64-bit Armv8 and Intel platforms, improving on previous results by 34.5% and 6%, respectively. While the latter speedup may seem small, we remark that it comes after decades of successful research in improving performance of ECC, so diminishing returns are expected. Due to the upcoming move to post-quantum cryptography, these techniques could be of limited practicality, but we believe they are relevant for applications not necessarily needing long-term security and involving the computation of many scalar multiplications, such as private set intersection protocols [31]. The proof and table compression technique may find further application in accelerating GLV/GLS scalar multiplication in pairing-based cryptography.

The rest of the document is organized as follows. Section 1 discusses preliminaries on binary GLS curves and their efficient implementation. Section 3 introduces the 2DT-GLS algorithm and its correctness proof. Section 4 pushes these ideas further by presenting the scalar decomposition algorithm, followed by dedicated formulas in Sect. 5. Section 6 discusses the implementation of field arithmetic in Armv8, with experimental results in Sect. 7. The interested reader will also find a treatment of point compression for binary GLS curves in the Appendix of the full version [1], together with detailed formulas.

2 Preliminaries

An ordinary binary elliptic curve in Weierstrass form is defined as

$$\begin{aligned} E/\mathbb {F}_q : y^2 +xy = x^3 + ax^2 + b \end{aligned}$$
(1)

with \(q=2^m\) and coefficients \(a,b\in \mathbb {F}_q, \ b \ne 0\). For any extension field \(\mathbb {F}_{q^k}\), the points \(P=(x,y) \in \mathbb {F}_{q^k} \times \mathbb {F}_{q^k}\) that satisfy the equation form an abelian group \(E_{a,b}(\mathbb {F}_{q^k})\) together with a point at infinity \(\infty \), which acts as the identity. The group law is denoted with additive notation \(P+Q\), such that the scalar multiplication operation is written as kP.

2.1 Binary GLS Curves

In the interest of defining notation for later use, we briefly summarize the theory of binary GLS curves from [17].

Let E be an ordinary binary curve as defined previously. From Hasse’s theorem, \(\#E(\mathbb {F}_q) = q+1-t\) for some trace t satisfying \(|t|\le 2\sqrt{q}\). Pick some \(a'\in \mathbb {F}_{q^2}\) with Tr\('(a')=1\), where Tr\('\) is the field trace from \(\mathbb {F}_{q^2}\) to \(\mathbb {F}_2\) defined as Tr\('(c)=\sum _{i=0}^{2m-1} c^{2^i}\). It can be shown that \(E'=E_{a',b}\) is the quadratic twist of E over \(\mathbb {F}_{q^2}\) with \(\#E'(\mathbb {F}_{q^2})=(q-1)^2 + t^2\), and that E and \(E'\) are isomorphic over \(\mathbb {F}_{q^4}\) under an involutive twisting isomorphism \(\phi \).

An endomorphism \(\psi \) over \(\mathbb {F}_{q^2}\) can be constructed for \(E'\) by composing \(\phi \) with the q-power Frobenius map \(\pi \) as \(\psi = \phi \pi \phi ^{-1}\). Evaluating \(\psi \) over points \(P \in E'(\mathbb {F}_{q^2})\) only requires field additions [28].

The curve \(E'\) would in this scenario be referred to as a binary GLS curve. Assume \(\#E'(\mathbb {F}_{q^2})=hr\) where h is a small cofactor and r is prime. Let \(\mathcal {G}\) be the unique order-r subgroup \(E'(\mathbb {F}_{q^2})[r]\). Then \(\psi \) restricted to \(\mathcal {G}\) has an eigenvalue \(\mu \in \mathbb {Z}\) so that for \(P\in \mathcal {G}\), \(\psi (P)=\mu P\). The eigenvalue satisfies that \(\mu ^2=-1 \bmod {r}\).

2.2 \(\lambda \)-Projective Coordinates for GLS Scalar Multiplication

In [28], Oliveira et al. introduced the \(\lambda \)-projective coordinate system. To date, it is empirically the most efficient point representation for binary elliptic curves. Given an affine point \(P=(x,y)\) with \(x\ne 0\), its \(\lambda \)-affine representation is \((x, \lambda )\) with \(\lambda = x + \frac{y}{x}\). In \(\lambda \)-projective coordinates, the affine point can be represented by any triple \((X,\varLambda , Z)\) with \(X=xZ\), \(\varLambda = \lambda Z\) and \(Z \ne 0\). The point at infinity can now be represented as \(\infty = (1, 1, 0)\). The curve equation in (1) is correspondingly transformed to

$$\begin{aligned} (\varLambda ^2 + \varLambda Z + aZ^2)X^2 = X^4 + bZ^4. \end{aligned}$$
(2)

The constant-time binary GLS scalar multiplication algorithm from [28] is included in Algorithm 1. It is a constant-time left-to-right double-and-add algorithm using \(\lambda \)-projective coordinates, combining the Joye-Tunstall regular recoding algorithm [20] with the GLV interleaving technique. For the GLV method, the scalar k is decomposed into two subscalars \(k_1, k_2\) such that \(k\equiv k_1 + k_2\mu \) (mod r) and the subscalars are of roughly half the length of k. The two smaller scalar multiplications can then be computed in an interleaved fashion to save half of the point doublings.

Algorithm 1
figure a

Constant-time scalar multiplication (Oliveira et al. [28])

The Joye-Tunstall regular recoding algorithm ensures the regular form of the algorithm and allows for a width-w windowing strategy. Specifically, the subscalars are recoded into \(\ell = \lceil m/(w-1) \rceil \) odd signed digits of \(w-1\) bits using Algorithm 6 from [10] (which is a constant-length modification of Algorithm 6 from [20]). By initially computing a table \(T[i]=(2i+1)P\) for all positive digits (in a phase known as the precomputation), the main loop can process the scalars one digit at a time, reducing the number of iterations by a factor \(w-1\). To be resistant against (cache-)timing attacks, each lookup requires a linear pass over the entire table, and there can be no branches dependent on \(c_j,k_j\). An additional consideration is that the regular recoding algorithm requires the subscalars to be odd. To ensure this, the subscalars are modified to be odd in line 3, and at the end the result is corrected at the cost of two point additions.

However, both [28] and the subsequent [27] suffer from a lack of rigor. First and foremost, no proof has been presented for correctness of the scalar multiplication algorithm. The \(\lambda \)-projective group law formulas are incomplete, so it could potentially fail in corner cases. It also relies on ad-hoc tricks for constant-time scalar decomposition, with no proof of correctness or length guarantees.

2.3 GLS254 and the Choice of Parameters

Previous works have benchmarked their implementation of scalar multiplications over a GLS curve specially crafted for efficiency at the 128-bit security level. For the GLS254 curve, one chooses \(m = 127\), such that the base field can be defined as \(\mathbb {F}_{q} \equiv \mathbb {F}_{2}[z]/(z^{127} + z^{63} + 1)\) and its quadratic extension as \(\mathbb {F}_{q^2} \equiv \mathbb {F}_{q}[u]/(u^2 + u + 1)\). The curve coefficients should be chosen to have minimal Hamming weight such that multiplying by them is as efficient as possible. We performed a parameter search that reproduced the curve chosen at [27]. By fixing \(a' = u\) and searching for the shortest \(b = (z^i + 1)\) such that the curve has order 2r for prime r, we were able to confirm that \(i = 27\) is the smallest choice, giving a 254-bit r. This means that a multiplication by b can be computed with a single shifted addition.

To protect against Weil descent and generalized Gaudry-Hess-Smart (gGHS) attacks [15, 18], several precautions must be taken. We pick m to be prime, as is the case for GLS254. In addition, the choice of b must be verified to not allow the attack, which happens with negligibly small probability for random b [17]. We used the MAGMA implementation of [8] available at https://github.com/JJChiDguez/gGHS-check to clear our particular choice. This particular check, together with the curve generation method geared towards efficiency, satisfies rigidity concerns [4]. We stress that the ECDLP in binary curves remains infeasible for the parameter range used in this work [12].

3 Scalar Multiplication in GLS Curves

In this section, we begin by presenting a new scalar multiplication variant for binary GLS curves. It combines the Shamir-Straus’ trick [9] for multiple scalar multiplication with a new table compression technique using the GLS endomorphism. We refer to it as the 2DT variant because it builds a two-dimensional table \(T[i,j]=iP+j\psi (P)\) instead of \(T[i]=iP\). Then, in Subsect. 3.2, we prove that the GLS scalar multiplication algorithms are exception-free.

3.1 The 2DT Variant

As in some fast variants of the Shamir-Straus’ trick [9] for multiple scalar multiplication, the idea is to precompute \(T[i,j]=iP + j \psi (P)\) for odd ij. In the scalar multiplication loop, we then save roughly one point addition per iteration of the main loop by computing \(2Q+T[i,j]\) instead of \(2Q+T[i]+\psi (T[j])\).

This method was previously deemed noncompetitive due to the blowup in the size of the table. Because the subscalar regular recoding uses signed digits, we need to efficiently retrieve \(s_1iP + s_2j\psi (P)\) for any \(i, j \in \{1, ..., 2^{w-1}-1\}\) and sign combination \(s_1, s_2 \in \{\pm 1\}\). The standard approach would be to build a table of \(iP\pm j\psi (P)\) and then use conditional negations to get the two other combinations. The 2D table would then store \(2^{2(w-2)+1}\) points. Even with specialized formulas for the precomputation, the cost in terms of storage and field operations is too high compared to the conventional 1DT algorithm.

The crucial new observation is that the efficiently computable GLS endomorphism \(\psi \) can also be used to compress the 2D table by a factor of 2. As \(\psi ^2(P)=-P\) for any \(P\in \mathcal {G}\), we obtain the identity

$$\begin{aligned} -\psi (T[j,i]) = iP - j\psi (P). \end{aligned}$$

It implies that we can generate all combinations from a table that only stores \(iP+j\psi (P)\) for positive ij. The rest of the combinations can be efficiently retrieved using conditional negations and conditional applications of \(\psi \).

This compression trick not only halve the amount of precomputation needed, but also halves the time needed to do a linear pass through the table in the main loop. With new specialized group law formulas for the precomputation (see Sect. 5), the 2DT algorithm is able to compete for the constant-time scalar multiplication speed record (see Sect. 7). The 2DT variant is presented in Algorithm 2. For \(w=2\), the only difference is that a complete formula must be used for \(2Q+P_1\) at \(i=1\) as well.

Algorithm 2
figure b

Constant-time 2DT scalar multiplication

3.2 Proof of Exception-Free Scalar Multiplication

We will now prove that the scalar multiplication algorithms presented here and in [28] (with a minor modification) is correct on all valid inputs. The core issue is that the underlying \(\lambda \)-projective group law formulas from [28] are not complete, meaning that they output the wrong result in some corner cases. Without these exceptions, correctness would be trivial.

One could explicitly handle these exceptions in constant time using complete formulas, but this would come at a high performance cost. Here, we prove that exceptional cases can only occur in the last iteration(s) of the main loop. By using complete formulas at the very end, correctness is ensured at only a minor performance penalty.

For clarity, the proof will be tailored to the 2DT algorithm. However, it can be easily adapted to the 1DT algorithm. The proof can be seen as a two-dimensional extension of the argument from Proposition 1 in [6]. We will for now assume that the scalar decomposition produces subscalars of bit-length at most \(m+1\), and defer the discussion about how to achieve this to Sect. 4. The proof crucially relies on the structure of the lattice discussed in [13, 14] that emerge in the GLV method for scalar decomposition;

$$\begin{aligned} \mathcal {L} = \{(x,y) \in \mathbb {Z}^2 : x + y\mu \equiv 0 \ (\text {mod } r)\}. \end{aligned}$$

Here r is the large prime order of \(\#E'(\mathbb {F}_{q^2})=hr\) and \(\mu \) the eigenvalue of \(\psi \) restricted to \(\mathcal {G}\). For our purposes, it is very useful to think about \(\mathcal {L}\) as the lattice of decompositions of zero (as done in [10]).

Our proof requires that the norm of the shortest non-zero vector in \(\mathcal {L}\) is at least \((q-1)/\sqrt{2}\). The structure of \(\mathcal {L}\) is determined by the order of \(E'(\mathbb {F}_{q^2})\). The bound required on the shortest norm might not be obtainable in general, so we focus on the subclass of GLS curves most relevant for cryptography. As the (affine) point (0, 0) is always in \(E'(\mathbb {F}_{q^2})\), 2|h. To ensure that the discrete log problem in \(\mathcal {G}\) is as hard as possible, the optimal choice is to pick a curve with \(h=2\), which is easy in practice. Restricting our proof to this subclass of GLS curves, we can give an explicit solution to the SVP in \(\mathcal {L}\).

Lemma 1

Let \(q=2^m\). Let \(E'\) be a binary GLS curve with \(\#E'(\mathbb {F}_{q^2}) = 2r\) for an odd prime r. Let \(E/\mathbb {F}_q\) be the curve such that \(E'\) over \(\mathbb {F}_{q^2}\) is the quadratic twist of \(E(\mathbb {F}_{q^2})\). Define

$$\begin{aligned} v_1 = \Big (\frac{(q-1)+t}{2}, \frac{(q-1)-t}{2} \Big ) \text { and }v_2 = \Big ( \frac{(q-1)-t}{2}, \frac{-(q-1)-t}{2}\Big ), \end{aligned}$$

where t is the trace of the q-th power Frobenius endomorphism on E. Then \(v_1,v_2\) form an orthogonal basis for the lattice \(\mathcal {L}\). \(\Vert v_1 \Vert = \Vert v_2 \Vert = \sqrt{r} = \min _{v\in \mathcal {L}\setminus \{0\}} \Vert v \Vert \).

Proof

Smith gives in Sects. 6 and 4 of [32] the basis \(v'_1 = (q-1, -t)\), \(v_2\) for \(\mathcal {L}\). However it is not orthogonal and \(\Vert v'_1 \Vert =\sqrt{2r}\). The latter follows from \(\#E'(\mathbb {F}_{q^2})=(q-1)^2 + t^2\) (see Theorem 2 in [13]). We make the small adjustment of replacing \(v'_1\) with \(v_1 = v'_1 - v_2\). It can easily be checked that the basis \(v_1, v_2\) is orthogonal and that \(\Vert v_1 \Vert = \Vert v_2 \Vert = \sqrt{r}\).    \(\square \)

For the sake of proving scalar multiplication exception-free, the importance of Lemma 1 is showing that the minimal norm in \(\mathcal {L}\) is large. This is what enables the subsequent proof to succeed. Note that we also require m to be prime, which is needed for security reasons anyways.

Theorem 1

Let the notation be as in Lemma 1. Let \(m>4\) be prime and \(2\le w\le m\). Then Algorithm 2 is exception-free.

Proof

Let us start by identifying the exceptional cases of the \(\lambda \)-projective formulas. The formula for \(P+Q\) breaks down whenever \(P=\pm Q\), \(P=\infty \) or \(Q=\infty \). The point \(\infty \) does not have a \(\lambda \)-affine representation, so these last two cases are only a concern when the points are \(\lambda \)-projective. The 2P formula has no exceptional cases. Finally, the atomic formula for \(2Q+P\) breaks down when \(P = \pm 2Q\) or \(Q = \infty \).

We will argue that all the exceptions that can occur in Algorithm 1 encode an element of \(\mathcal {L}\). By this, we mean that they define some \(z_1, z_2 \in \mathbb {Z}\) such that \(z_1P + z_2\psi (P) = \infty \). Then \((z_1, z_2)\in \mathcal {L}\).

If \((z_1, z_2) \ne (0,0)\), we can show that either \(|z_1|\) or \(|z_2|\) must be at least an m-bit integer. \(v_1,v_2\) form an orthogonal basis of \(\mathcal {L}\), and are both a solution to the SVP in \(\mathcal {L}\) with norm \(\sqrt{r}\). Using the Hasse bound we get that \(\Vert v_1 \Vert , \Vert v_2 \Vert \ge (q-1)/\sqrt{2}\). Now assume for contradiction that \(|z_1|, |z_2| < q/2\). Then

$$\begin{aligned} \Vert (z_1, z_2) \Vert = \sqrt{z_1^2 + z_2^2} \le \sqrt{2\Big (\frac{q}{2}-1\Big )^2}=\frac{q-2}{\sqrt{2}} < \Vert v_1 \Vert , \Vert v_2 \Vert . \end{aligned}$$

This is a contradiction, since \(v_1, v_2\) have minimal norm in \(\mathcal {L} \setminus \{ {\textbf {0}} \}\). Thus, it must be the case that \(|z_1|\ge q/2\) or \(|z_2|\ge q/2\).

We now have all the tools needed to prove that no exceptions occur in Algorithm 2, and we will start with the precomputation stage. Assume that an exception did occur in the computation of \(iP +j\psi (P)\) for some odd \(i,j\in \{1, \ldots , 2^{w-1}-1\}\). \(P, \psi (P) \ne \infty \), so there can only be an exception if \(iP = s j\psi (P)\) for some \(s\in \{\pm 1\}\). Then \((i, -sj)\in \mathcal {L} \setminus \{ {\textbf {0}} \}\). However, this is a contradiction, as neither \(|i|=i\) nor \(|-sj|=j\) are m-bit integers.

Next is the main loop. Let \(Q_i=z_{1,i}P+z_{2,i}\psi (P)\) denote the value of Q after iteration i. No exception occurred in the precomputation stage, meaning Q is correctly initialized to \(Q_{\ell -1} = \overline{k}_{1,\ell -1}P + \overline{k}_{2,\ell -1}\psi (P)\). Then at iteration i,

$$\begin{aligned} Q_i=(2^{w-1}z_{1, i+1} + \overline{k}_{1,i})P + (2^{w-1}z_{2, i+1} + \overline{k}_{2,i})\psi (P). \end{aligned}$$

Observe that as long as no exceptions occur, we have the invariant that

$$\begin{aligned} 0<|z_{j,i+1}|\le |z_{j,i}|\le 2^{(\ell -i)(w-1)}-1 \text { for }j=1,2. \end{aligned}$$

Assume the first exception occurs at iteration i. The \(w-2\) doublings are exception-free, so the exception must have been caused by the computation of \(2Q_{i+1}+P_i\). \(Q_{i+1}\) cannot have been \(\infty \). This is because \(2Q\ne \infty \) for any \(Q \ne \infty \) and the incomplete \(2Q+P\) formula does not output \(\infty \) for any PQ on the curve. Hence, the first exception must have occurred because \(2^{w-1}Q_{i+1}=sP_i\) for an \(s\in \{\pm 1\}\). This is equivalent to

$$\begin{aligned} (2^{w-1}z_{1,i+1}-s\overline{k}_{1,i})P + (2^{w-1}z_{2,i+1}-s\overline{k}_{2,i})\psi (P) = \infty \end{aligned}$$

Define \(z'_{j,i}=2^{w-1}z_{j,i+1}-s\overline{k}_{j,i}\). Notice that \(-s\overline{k}_{j,i}\) is a valid digit of the regular recoding. The invariants for \(|z_{j,i}|\) also hold for \(|z'_{j,i}|\). Hence, \((z'_{1,i}, z'_{2,i}) \in \mathcal {L} \setminus \{ {\textbf {0}} \}\).

Since m is prime, it holds for all \(2 \le w \le m\) that

$$\begin{aligned} 2^{(\lceil \frac{m}{w-1} \rceil -1)(w-1)} - 1 \le 2^{(\frac{m}{w-1} + \frac{w-2}{w-1}-1)(w-1)} -1 = 2^{m-1} -1 \end{aligned}$$

This means that \(|z'_{1,i}|\) and \(|z'_{2,i}|\) are of at most \(m-1\) bits while \(i\ge \ell - \lceil \frac{m}{w-1} \rceil + 1\). The implication is that the first exception could not have occurred in these iterations, so we must have that \(i\le \ell - \lceil \frac{m}{w-1} \rceil \).

For \(w=2\), this means that \(i\le 1\). For all other w, this means that \(i=0\). But these are exactly the iterations that use a complete formula for the computation of \(2Q_{i+1}+P_i\) for the respective values of w. Thus, it is impossible for the first exception to occur in these last iterations. The conclusion is that there can be no exception in the main loop.    \(\square \)

4 Scalar Decomposition with Parity and Length Guarantees

The GLV method for scalar decomposition needs a bit of care when required to run in constant-time while preserving length guarantees. The GLV method uses a reduced basis \(\{u_1, u_2\}\) of some sublattice \(\mathcal {L}'\) of \(\mathcal {L}\) (see Sect. 3.2) to solve the CVP problem for (k, 0) in \(\mathcal {L}'\) using Babai rounding [14]. For a given basis, there exist unique constants \(N, \alpha _1, \alpha _2\in \mathbb {Z}\) such that

$$\begin{aligned} (k,0) = \beta _1u_1 - \beta _2u_2, \end{aligned}$$

where \(\beta _i = \frac{\alpha _i}{N}k\). The subscalars \(k_1, k_2\) are then computed as

$$\begin{aligned} (k_1, k_2) = (k,0) - b_1 u_1 - b_2 u_2, \end{aligned}$$

where \(b_i = \lceil \beta _i \rfloor \). The magnitude of the subscalars can then be bounded by some expression depending on the norm of the basis vectors.

The issue for constant-time implementations is the computation of the \(b_i\)’s. Ideally we would compute them using divisions, but unfortunately divisions do not run in constant time in most processorsFootnote 1.

The standard solution was first introduced in [5] and further analyzed in [10]. The idea is to approximate the computation of the \(b_i\)’s using integer divisions by powers of 2, which can be implemented in constant-time using shifts. Choose some integer d such that \(k < 2^{d}\), and precompute the constants \(c_i = \lfloor \frac{\alpha _i}{N} 2^d \rceil \). Then at runtime compute \(b_i\) as \(b'_i = \lfloor \frac{c_i}{2^d}k \rfloor \).

This approach introduces rounding errors. It was shown in Lemma 1 of [10] that \(b'_i\) will either be \(b_i\) or incorrectly rounded down to \(b_i-1\). This does not affect the correctness of the decomposition. However, it does negatively impact the bounds on \(|k_1|, |k_2|\). If the bounds become too loose, we might need more iterations of the main loop of the scalar multiplication to ensure correctness.

We present Algorithm 3 for constant time scalar decomposition using the optimal basis from Lemma 1. In Lemma 2 we prove that it outputs subscalars of at most \(m+1\) bits. Without rounding errors it would be m, but the extra bit does not affect the number of iterations of the main loop (see Corollary 1).

In fact, not handling rounding errors leads to more efficient scalar multiplication. Using the fact that the rounding errors are one-sided, we introduce a new trick to ensure that the \(k_1, k_2\) are always odd, without affecting the length guarantees. The two point additions needed to correct for even subscalars (see Subsect. 2.2) are no longer needed, leading to a simpler and faster algorithm. Note that this optimization applies to both 1DT and 2DT scalar multiplication.

Algorithm 3
figure c

Constant-time fixed-parity scalar decomposition for binary GLS curves with \(h=2\)

Lemma 2

Let the notation be as in Lemma 1 and assume that \(h=2\) and \(m>4\). Algorithm 3 on input \(k\in [1,r-1]\) outputs a valid decomposition \(k_1,k_2\). The subscalars are odd and \(|k_1|, |k_2|<2q\).

Proof

Let \(k_1\), \(k_2\) be the output of the GLV method on input k and let \(k'_1, k'_2\) the output of Algorithm 3. We start with correctness. Per definition, \((k,0) + \mathcal {L} \in \mathbb {Z}^2 / \mathcal {L}\) is the set of valid decompositions of k. Algorithm 2 produces \((k'_1, k'_2)\) by adding integer multiples of \(v_1\) and \(v_2\) to (k, 0). Hence, \((k'_1, k'_2) \in (k,0)+\mathcal {L}\).

Next, let’s bound the magnitude of the subscalars. The basis vectors are orthogonal with norm \(\sqrt{r}\). By the same argument as in Lemma 3 of [13] it follows that \(\Vert (k_1, k_2) \Vert \le \sqrt{r/2}\). To make the analysis independent of r, we can upperbound it as \(r \le (q+1)^2 /2\) using the Hasse bound. Then \(\Vert v_1 \Vert , \Vert v_2 \Vert \le (q+1)/\sqrt{2}\) and \(\Vert (k_1, k_1) \Vert \le (q+1)/2\).

It can be easily verified that \(\alpha _1, \alpha _2, N\) are specified such that \((k,0)=\beta _1 v_1 + \beta _2 v_2\). Since \(k<r\le (q+1)^2/2 \le q^2 \le 2^d\), it follows from Lemma 1 of [10] that \(b'_i\) is either \(b_i\) or incorrectly rounded down to \(b_i-1\).

Let \(r_i\) be the bit that is 1 if such a rounding error occurred when computing \(b'_i\). Let \(s_i\) be the bit that is 1 if \(v_i\) was subtracted from \((k_1,k_2)\) at line 8. Then using the triangle inequality, we can derive the bound on the subscalars.

$$\begin{aligned} |k'_1|, |k'_2| &\le \Vert (k'_1, k'_2) \Vert \\ &= \bigg \Vert \sum _{i=1}^2 (\beta _i - (b_i - r_i+s_i))v_i \bigg \Vert \\ &\le \bigg \Vert \sum _{i=1}^2(\beta _i - b_i)v_i\bigg \Vert + \Vert v_1 \Vert + \Vert v_2 \Vert \\ &\le \Big (\frac{q+1}{2}\Big ) + 2 \Big (\frac{q+1}{\sqrt{2}}\Big )\\ &< 2q {} & {} (\text {Assuming } m > 4) \end{aligned}$$

Finally, we will show that \(k'_1,k'_2\) are odd. The proof of Lemma 1 establishes that t is odd. \(\frac{(q-1)+t}{2} = \frac{(q-1)-t}{2} + t\), meaning exactly one of the coordinates of \(v_1\) are odd. By symmetry, only the other coordinate of \(v_2\) is odd. Because \(\alpha _1 = 2\Big (\frac{(q-1)+t}{2}\Big )\), \(\alpha _1 \equiv 0\) (mod 4) exactly when the 1st coordinate of \(v_2\) is odd. Then \(u_i\) is the basis vector with the odd i-th coordinate. Subtracting \((k'_1,k'_2)\) by \(u_i\) flips the parity of \(k'_i\) but leaves the parity of the other subscalar unchanged. \(p_i = 1\) exactly when \(k_i\) is even, meaning that the subscalars output are odd.    \(\square \)

Corollary 1

Let \(m > 4\) be a prime number. For any window size \(2<w\le m\), the number of digits needed to recode the subscalars output by Algorithm 3 is the same as one would need for the subscalars output by the GLV method with no rounding errors. For \(w=2\), one more digit is required.

5 New Formulas for Faster Precomputation

The 2DT scalar multiplication variant represents a different strategy for utilizing precomputation. The table grows quadratically faster than its 1DT counterpart, so reducing the cost of its precomputation is crucial. For both the 1DT and 2DT variants, we present more efficient strategies for the precomputation stage. The 2DT precomputation (Algorithm 5) uses the 1DT precomputation (Algorithm 4) as a subroutine, which makes a case for the fairness of our optimization efforts.

The precomputation algorithms depend on several new atomic group law formulas. Compared to doing the operations using the existing formulas from [28], they provide a significant saving in the number of field multiplications and squarings needed. Because these formulas are derived by combining the original group law formulas, they do not introduce additional exceptions. The new formulas that are nontrivial to derive are included in the Appendix of the full version [1]. An overview of the cost of the specialized \(\lambda \)-projective group law formulas is given in Table 1.

It follows from the same argument as in Theorem 1 that the new precomputation algorithms are exception-free. At any step we compute \(iP+j\psi (P)\) for small coefficients ij, where at least one of them are nonzero. Then \((i,\pm j) \notin \mathcal {L}\), meaning that there cannot be any exceptions.

Algorithm 4
figure d

Precomputation-1D

Algorithm 5
figure e

Precomputation-2D

Table 1. Cost of the \(\lambda \)-projective group law formulas with respect to the number of multiplications, multiplications by curve coefficients a and b and squarings \((\tilde{m}, \tilde{m_a}, \tilde{m_b}, \tilde{s})\) over the extension field \(\mathbb {F}_{q^2}\). For the mixed point representations, Q is \(\lambda \)-projective while P, \(P_1\) and \(P_2\) are \(\lambda \)-affine. The formulas that have not been derived or that provided insignificant speedups are marked with ‘-’.

6 Binary Field Arithmetic for Arm

This section details our Arm implementation of the GLS254 curve. The focus will be on the field arithmetic. It was implemented specifically for the platform, relying heavily on 128-bit Arm Neon vector instructions to achieve high performance. The rest of the curve implementation is almost exclusively written in C, and therefore does not differ much from the Intel implementation from [28].

Specifically, our implementation targets Armv8 AArch64, which introduces some new useful instructions for cryptographic implementations. In particular, we take advantage of the new PMULL vector instruction for 64-bit binary polynomial multiplication, a direct analogue of PCLMULQDQ for Intel. For convenience of implementation, we use C intrinsics for the Arm Neon vector instructions.

This section first details the implementation of the base field \(\mathbb {F}_q\) with \(q=2^m\) and \(m=127\), then how we implement the quadratic extension field \(\mathbb {F}_{q^2}\) on top.

6.1 Arithmetic in the Base Field \(\mathbb {F}_q\)

Representation of Elements. One benefit of the choice of field, is that the bit vector representation of \(a \in \mathbb {F}_q\) is 127 bits long, meaning that we can fit it in a single 128-bit Neon vector register. We denote a[0], a[1] as respectively the least significant and most significant word of the 128-bit register that stores \(a \in \mathbb {F}_{2^{127}}\). a[0] stores the bit vector for the terms \(z^0\) to \(z^{63}\), a[1] terms \(z^{64}\) to \(z^{126}\). We will sometimes use the notation \(a=\{a[0], a[1]\}\) to show the contents of the register.

An efficiency issue for Arm AArch64, compared to AArch32, is that it cannot reference the upper word a[1] of a 128-bit register as a separate 64-bit register [16]. Instead, one needs to use the Arm Neon instruction EXT. It takes two registers ab and outputs \(\{a[1], b[0]\}\). The lower half of this output can then be referenced for further computation. Table 2 gives an overview of all the Neon instructions that we used for our implementation.

Table 2. Arm Neon 128-bit vector instructions used. The first 128-bit operand is denoted a, the second b. The output is also stored in a 128-bit register.

Polynomial Multiplication. The polynomial multiplication algorithm takes as input two binary polynomials \(a,b\in \mathbb {F}_q\) and outputs their polynomial product c. The degree of c can be up to twice the degree of the operands. Hence it must be stored in two 128-bit vector registers \(c_0, c_1\), where \(c_0\) stores the lower half.

For polynomial multiplication we use the Arm Neon implementation from [16]. It efficiently performs 128-bit polynomial multiplication using the new PMULL instructions. While they managed to implement it using 3 multiplications with the Karatsuba algorithm on AArch32, the high number of EXTs this would require on AArch64 meant that they instead opted for an algorithm with an extra PMULL.

Polynomial Squaring and Field Multi-squaring. Polynomial squaring of an \(a\in \mathbb {F}_q\) can be trivially implemented as \(c_0 \leftarrow \text {pmull\_bot}(a,a)\), \(c_1 \leftarrow \text {pmull\_top}(a,a)\).

For multi-squaring in settings where it does not need to be computed in constant time, we implemented the technique from [2, 7]. It uses lookup tables that are precomputed offline to compute the reduced result of \(a^{2^k}\). However, for smaller Arm processors like the Cortex-A55, this method only outperforms the naive loop implementation for \(k>12\). This is a lot higher than the threshold of \(k>5\) for Intel [28]. It is an example of the higher cost of memory access on smaller devices, which results in a lower yield for precomputation strategies.

Modular Reduction. To compute a field multiplication or squaring, the result of the polynomial algorithm must be reduced modulo f(z). We here present novel algorithms for efficient modular reduction, using exclusively Arm Neon vector instructions. Like the polynomial multiplication algorithm from [16], they attempt to minimize the number of accesses to the top half of the 128-bit registers, each of which incurs the cost of an EXT.

For reducing the result of a polynomial multiplication, we use Algorithm 6. It implements the lazy reduction technique from [27]. Instead of reducing f(z), we reduce by the redundant trinomial \(z \cdot f(z) = z^{128}+z^{64}+z\). Reductions by zf(z) are roughly 40% faster than proper reductions by zf(z). The result can have degree up to 127 instead of 126, but as the result still fits in a 128-bit register, this makes no difference. As \((c \text { mod } zf(z))\) mod \(f(z) = c\) mod f(z), one can easily recover the properly reduced result from the output of the lazy reduction. This is done by conditionally adding \(z^{63}+1\) to it when bit 127 is set.

Algorithm 6
figure f

Lazy reduction by \(z\cdot f(z)=z^{128}+z^{64}+z\)

It is possible to reduce a squaring slightly faster, exploiting the fact that the result only has bits set at even positions. Thus, we can remove the logic from Algorithm 6 that handles the carry of bit 191 from the left shift by 1. Concretely, this means removing lines 2 and 5, and replacing \(t_2\) by \(t_1\).

Field Inversion. Field inversion is done in the same way as described in [28], using the Itoh-Tsujii algorithm [19]. We generated our addition chain for \(m-1=126\) using McLoughlin’s addchain library [26].

$$\begin{aligned} 1 \rightarrow 2 \rightarrow 3 \rightarrow 6 \rightarrow 12 \rightarrow 24 \rightarrow 30 \rightarrow 48 \rightarrow 96 \rightarrow 126 \end{aligned}$$

The cost of a field inversion is therefore \(m-1\) squarings and 9 multiplications. The steps after 30 involve multi-squarings with \(k>12\). When the inversion does not have to be in constant time, these steps can then be sped up using the table-based multi-squaring approach.

6.2 Arithmetic in the Extension Field \(\mathbb {F}_{q^2}\)

The elements of \(\mathbb {F}_{q^2}\) can be represented as polynomials \(a_1u + a_0\), with coefficients \(a_0, a_1 \in \mathbb {F}_q\). Therefore, we need two 128-bit registers to represent them. The extension field arithmetic can be implemented from the base field arithmetic, using the identities presented in [28].

In [27], Oliveira et al. present an algorithm for simultaneously reducing both coefficients of an element in \(\mathbb {F}_{q^2}\) at the cost of only a single base field reduction. We have included the Arm Neon implementation in Algorithm 7.

Algorithm 7
figure g

Lazy simultaneous reduction by \(z f(z) = z^{128} + z^{64} + z\) for coordinate-wise reduction in \(\mathbb {F}_{q^2}\) (Oliveira et al. [27])

However, the reduction algorithm requires the field elements to be kept in an interleaved representation. For an \(a\in \mathbb {F}_{q^2}\), let \(a_0, a_1\) be the 128-bit registers storing each of its coefficients. Then the interleaved representation of a is

$$\begin{aligned} a'_0 =\{a_0[0], a_1[0]\}, \ a'_1 = \{a_0[1], a_1[1]\}. \end{aligned}$$
(3)

Note that \(a'_0\), \(a'_1\) store \(a_0\) in the lower half and \(a_1\) in the upper half. The larger input to the reduction algorithm must also be in an interleaved representation. Let \(c_0, c_1, c_2, c_3\) be the non-interleaved 128-bit registers storing the result of a polynomial multiplication or squaring. Because this result is computed using the identities from [28], the result is already reduced as a polynomial in u, but has coefficients that need further reduction in \(\mathbb {F}_q\). Then \(c_0\), \(c_1\) store the unreduced constant coefficient and \(c_2\), \(c_3\) the other. The interleaved representation \(c'_0, c'_1, c'_2, c'_3\) of these unreduced coefficients continue the pattern from (3). \(c'_0\), \(c'_1\) are \(c_0\), \(c_2\) interleaved, and \(c'_2\), \(c'_3\) are \(c_1, c_3\) interleaved.

In order to reap the benefits of the reduction algorithm, we had to implement the \(\mathbb {F}_{q^2}\) arithmetic directly in the interleaved representation. To do this, we manually merged and interleaved the base field algorithms to compute the identities from [28]. The only exception is inversion, where a standard base field inversion is used as a subroutine, which again uses all the arithmetic operations discusses in the previous section. While the abstraction between base field and extension field is somewhat broken for the sake of performance, the base field implementation is still the crucial foundation.

7 Results and Discussion

Our implementations, together with SAGE scripts for verification and operation counts, can be found at https://github.com/dfaranha/gls254.

7.1 Operation Counts for Binary GLS Scalar Multiplication

Table 3. The cost of the scalar multiplications with respect to the number of inversions, multiplications and squarings (\(\tilde{i}\), \(\tilde{m}\), \(\tilde{s}\)) over \(\mathbb {F}_{q^2}\). The total cost in field multiplications are estimated using \(\tilde{i}_{\text {non-ct}}=18\tilde{m}\), \(\tilde{i}_{\text {ct}} = 27\tilde{m}\) and \(\tilde{s} = 0.4\tilde{m}\) based on our benchmarks, and rounded to the nearest integer. “prev” denotes the cost for previous work.

Throughout our work, we have used field operation counts as a measure of the complexity of the scalar multiplication variants. With an understanding of the relative cost of the operations, the count gives a platform-independent estimate of the relative performance of the algorithms. In particular, it guided our choice of window size. However, it crucially does not capture the space-time trade-offs of a particular architecture. For the variants discussed in this paper, which are precisely different strategies for how to use space, this trade-off has a significant impact. This will be apparent in the next subsection.

Table 3 gives an overview the operation counts for the variants. Additions and multiplications by the curve coefficients are ignored due to their insignificant impact on performance. We include the costs of the 1DT algorithm without the new formulas and scalar decomposition to highlight the impact of our contributions. For the sake of fairness, it has been modified to be exception-free in the same way as the others.

As expected, the 2DT algorithm spends more time on precomputation and less in the main loop. We see that the model predicts \(w=5\) to be the sweet spot for 1DT and \(w=4\) for 2DT. Notably, 2DT \(w=3\) is predicted to be faster than 1DT \(w=5\) using only half the amount of space.

For a simpler comparison, we estimate the total cost in terms of field multiplications. The relative cost of each operation will of course differ from processor to processor, so we tried to go for the middle-ground based on our benchmarks. With this simplification, the model predicts that 1DT with \(w=5\) should be \(2.7\%\) faster from our contributions. The 2DT approach with \(w=4\) is predicted to be \(7.2\%\) faster than 1DT in previous work with \(w=5\), and \(4.6\%\) faster than 1DT with \(w=5\) from this work. Non-constant-time field inversion is used to convert points from projective to affine in the precomputation table only, since it does not depend on the (secret) scalar.

7.2 Implementation Timings

We start by describing our benchmarks for the Armv8 AArch64 implementation, written from scratch. We used the ODROID C4 single board computer, as we wanted a smaller device that could be representative for the majority of Arm devices. It comes with a Quad-Core Cortex-A55, which is considered a mid-range processor. We employ clang from LLVM 13 with optimization level -O3.

Table 4. Benchmarks (in clock cycles) of the field arithmetic on an Arm Cortex-A55 2.0 GHz. The cost of reduction is included in the cost of the multiplication and squaring. Base field reduction is mod zf(z). Op/\(m_b\), Op/\(\widetilde{m}\) denotes the cost relative to respectively base/extension field multiplication.

The benchmarks for our field implementation are presented in Table 4. Notice that non-constant time inversions that use lookup tables are roughly \(33\%\) faster.

Table 5. Constant-time variable base scalar multiplication benchmarks that are mostly performed on an Arm Quad-Core Cortex-A55 2.0 GHz. Memory is measured in terms of the number of elliptic curve points stored in the online precomputed table.
Table 6. Protected variable base scalar multiplication benchmarks for 64-bit Intel Core i7 4770 Haswell at 3.4 GHz, and Core i7 7700 Kaby Lake at 3.6 GHz, both with TurboBoost disabled. Memory is measured in terms of the number of elliptic curve points stored in the online precomputed table.

Table 5 presents our scalar multiplication timings in GLS254 and comparisons to related work. Compared to Intel, there are not a lot of efficient implementations specialized for Arm at the 128-bit security level. Four\(\mathbb {Q}\) [10] is the closest competitor on Intel, and they also provide specialized implementations for Arm [25]. We benchmarked their Armv8 AArch64 implementation on our machine and included their Armv7 timings from [25] for the sake of fairness. A notable outlier is Lenngren’s implementation for Curve2559, which is a much closer competitor on Arm than any Curve25519 implementation on Intel.

As the first GLS254 implementation for Armv8, we are able to claim the constant-time scalar multiplication speed record by \(34.5\%\) in comparison to the previous state of the art. Contrary to the operation counts in Table 3, it is the 2DT \(w=3\) algorithm that is the superior variant. We do not compare against [24] due to the radically different choices of target platform.

For our Intel implementation, we extended the AVX-accelerated code from [31] with the new formulas and 2DT variant. Due to space limitations, we report timings for field arithmetic in the Appendix of the full version [1]. The scalar multiplication benchmarks are presented in Table 6. We benchmarked our code on an older Core i7 4770 Haswell processor, and a Core i7 7700 Kaby Lake as the closest to the Skylake in [27]; both using clang from LLVM 13 and optimization level -O3. For 1DT \(w=5\), we achieve small speedups of 4.8% in Haswell and 4.1% for Skylake over the previous state of the art. The 2DT \(w=3\) variant achieves a further speedup of 2%. Surprisingly, 2DT \(w=4\) performs relatively poorly due to expensive conditional moves within the longer linear pass. The cumulative speedup over previous work is around 6% on both machines. In comparison to Four\(\mathbb {Q}\), our timings are 24% faster and set a new speed record for constant-time scalar multiplication in Intel processors.

Removing the linear pass for the sake of experimentation, 2DT \(w=4\) outperforms the other variants on all the platforms benchmarked, as predicted by the operation counts. The relative cost of the linear pass seems to be the determining factor for how much of a speedup the new approach yields in practice. In general, the linear pass incurs a relatively high cost, favoring solutions that minimize the size of the lookup table. Hence, we see that the 2DT \(w=3\) variant claims the speed record across the board, using only half as much space as the previous record holder 1DT \(w=5\).