Abstract
We revisit and improve performance of arithmetic in the binary GLS254 curve by introducing the 2DT-GLS scalar multiplication algorithm. The algorithm includes theoretical and practice-oriented contributions of potential independent interest: (i) for the first time, a proof that the GLS scalar multiplication algorithm does not incur exceptions, such that faster incomplete formulas can be used; (ii) faster dedicated atomic formulas that alleviate the cost of precomputation; (iii) a table compression technique that reduces the storage needed for precomputed points; (iv) a refined constant-time scalar decomposition algorithm that is more robust to rounding. We also present the first GLS254 implementation for Armv8. With our contributions, we set new speed records for constant-time scalar multiplication by \(34.5\%\) and \(6\%\) on 64-bit Arm and Intel platforms, respectively.
Keywords
1 Introduction
Elliptic Curve Cryptography (ECC) has become the de facto standard for instantiating public key cryptography, with security based on the conjectured-as-exponential hardness of solving discrete logarithms over elliptic curve groups (ECDLP problem). Scalar multiplication, in particular for the unknown point scenario, is the most expensive operation in cryptographic protocols with security guarantees based on the ECDLP. Since its introduction in 1985, there was plenty of research in finding efficient and secure implementation strategies, and choosing optimal parameters to improve performance [3, 6, 22].
An early milestone in this research was the idea due to Gallant-Lambert-Vanstone (GLV) [14] of exploiting efficient endomorphisms to accelerate scalar multiplication. In the large characteristic case with the curve \(E : y^2 = x^3 + b\) defined over \(\mathbb {F}_p\) for prime p, it initially manifested as evaluating \(\psi : (x,y) \rightarrow (\beta x, y)\) for \(\beta \) a non-trivial cube root of unity. The technique was later generalized to Galbraith–Lin–Scott (GLS) curves defined over \(\mathbb {F}_{p^2}\), and to exploit two or more endomorphisms as in the Four\(\mathbb {Q}\) curve [25] and the genus-2 case [5]. In the characteristic 2 case, Koblitz curves are the classical example of curves equipped with endomorphisms [29]; a class later extended to include binary GLS curves [17].
Beyond performance, implementation security is also relevant, especially on embedded targets where side-channel attacks are more feasible. The classical countermeasure against these attacks is to formulate the arithmetic in a regular way, such as constant-time implementation against timing attacks. This translates to removing secret-dependent branches and memory accesses, and employing complete point addition formulas without corner cases [30]. Additional care needs to be taken for a correct and secure implementation when exploiting endomorphisms, such as using the GLV-SAC recoding technique [11] or by explicitly proving correctness of the specific scalar multiplication algorithm [6, 10].
In this paper, we revisit the implementation of scalar multiplication in binary GLS curves at the 128-bit security level by improving efficiency and correctness of implementations of the GLS254 curve. Our contributions are:
-
A variant of the GLS scalar multiplication algorithm that changes the computation/storage trade-off to use a 2D table. The resulting 2DT algorithm spends more precomputation to reduce point additions in the main loop.
-
The first proof of correctness for the GLS scalar multiplication algorithm with the \(\lambda \)-projective group law from [28]. We show that there are no corner cases in the main loop of the algorithm (with exception of possibly the last iteration), which enables the use of faster incomplete formulas.
-
Faster dedicated formulas to reduce the cost of precomputation, and a table compression technique that exploits the endomorphism to reduce storage.
-
A refined scalar decomposition algorithm that can be easily implemented in constant time. The algorithm has robust parity and length guarantees that fill some gaps in previous works [28].
-
An efficient formulation of arithmetic in \(\mathbb {F}_{2^{254}}\) targeting Arm processors. The field arithmetic uses the interleaved representation proposed in the CHES’16 Rump Session [27]. We take the opportunity to include this formulation in the formal research literature, initially presented informally. Furthermore, this also closes affirmatively a question posed in [21] about the efficiency of binary curves in Armv8 processors, deemed “unclear”.
With these contributions, we obtain speed records in the 64-bit Armv8 and Intel platforms, improving on previous results by 34.5% and 6%, respectively. While the latter speedup may seem small, we remark that it comes after decades of successful research in improving performance of ECC, so diminishing returns are expected. Due to the upcoming move to post-quantum cryptography, these techniques could be of limited practicality, but we believe they are relevant for applications not necessarily needing long-term security and involving the computation of many scalar multiplications, such as private set intersection protocols [31]. The proof and table compression technique may find further application in accelerating GLV/GLS scalar multiplication in pairing-based cryptography.
The rest of the document is organized as follows. Section 1 discusses preliminaries on binary GLS curves and their efficient implementation. Section 3 introduces the 2DT-GLS algorithm and its correctness proof. Section 4 pushes these ideas further by presenting the scalar decomposition algorithm, followed by dedicated formulas in Sect. 5. Section 6 discusses the implementation of field arithmetic in Armv8, with experimental results in Sect. 7. The interested reader will also find a treatment of point compression for binary GLS curves in the Appendix of the full version [1], together with detailed formulas.
2 Preliminaries
An ordinary binary elliptic curve in Weierstrass form is defined as
with \(q=2^m\) and coefficients \(a,b\in \mathbb {F}_q, \ b \ne 0\). For any extension field \(\mathbb {F}_{q^k}\), the points \(P=(x,y) \in \mathbb {F}_{q^k} \times \mathbb {F}_{q^k}\) that satisfy the equation form an abelian group \(E_{a,b}(\mathbb {F}_{q^k})\) together with a point at infinity \(\infty \), which acts as the identity. The group law is denoted with additive notation \(P+Q\), such that the scalar multiplication operation is written as kP.
2.1 Binary GLS Curves
In the interest of defining notation for later use, we briefly summarize the theory of binary GLS curves from [17].
Let E be an ordinary binary curve as defined previously. From Hasse’s theorem, \(\#E(\mathbb {F}_q) = q+1-t\) for some trace t satisfying \(|t|\le 2\sqrt{q}\). Pick some \(a'\in \mathbb {F}_{q^2}\) with Tr\('(a')=1\), where Tr\('\) is the field trace from \(\mathbb {F}_{q^2}\) to \(\mathbb {F}_2\) defined as Tr\('(c)=\sum _{i=0}^{2m-1} c^{2^i}\). It can be shown that \(E'=E_{a',b}\) is the quadratic twist of E over \(\mathbb {F}_{q^2}\) with \(\#E'(\mathbb {F}_{q^2})=(q-1)^2 + t^2\), and that E and \(E'\) are isomorphic over \(\mathbb {F}_{q^4}\) under an involutive twisting isomorphism \(\phi \).
An endomorphism \(\psi \) over \(\mathbb {F}_{q^2}\) can be constructed for \(E'\) by composing \(\phi \) with the q-power Frobenius map \(\pi \) as \(\psi = \phi \pi \phi ^{-1}\). Evaluating \(\psi \) over points \(P \in E'(\mathbb {F}_{q^2})\) only requires field additions [28].
The curve \(E'\) would in this scenario be referred to as a binary GLS curve. Assume \(\#E'(\mathbb {F}_{q^2})=hr\) where h is a small cofactor and r is prime. Let \(\mathcal {G}\) be the unique order-r subgroup \(E'(\mathbb {F}_{q^2})[r]\). Then \(\psi \) restricted to \(\mathcal {G}\) has an eigenvalue \(\mu \in \mathbb {Z}\) so that for \(P\in \mathcal {G}\), \(\psi (P)=\mu P\). The eigenvalue satisfies that \(\mu ^2=-1 \bmod {r}\).
2.2 \(\lambda \)-Projective Coordinates for GLS Scalar Multiplication
In [28], Oliveira et al. introduced the \(\lambda \)-projective coordinate system. To date, it is empirically the most efficient point representation for binary elliptic curves. Given an affine point \(P=(x,y)\) with \(x\ne 0\), its \(\lambda \)-affine representation is \((x, \lambda )\) with \(\lambda = x + \frac{y}{x}\). In \(\lambda \)-projective coordinates, the affine point can be represented by any triple \((X,\varLambda , Z)\) with \(X=xZ\), \(\varLambda = \lambda Z\) and \(Z \ne 0\). The point at infinity can now be represented as \(\infty = (1, 1, 0)\). The curve equation in (1) is correspondingly transformed to
The constant-time binary GLS scalar multiplication algorithm from [28] is included in Algorithm 1. It is a constant-time left-to-right double-and-add algorithm using \(\lambda \)-projective coordinates, combining the Joye-Tunstall regular recoding algorithm [20] with the GLV interleaving technique. For the GLV method, the scalar k is decomposed into two subscalars \(k_1, k_2\) such that \(k\equiv k_1 + k_2\mu \) (mod r) and the subscalars are of roughly half the length of k. The two smaller scalar multiplications can then be computed in an interleaved fashion to save half of the point doublings.
The Joye-Tunstall regular recoding algorithm ensures the regular form of the algorithm and allows for a width-w windowing strategy. Specifically, the subscalars are recoded into \(\ell = \lceil m/(w-1) \rceil \) odd signed digits of \(w-1\) bits using Algorithm 6 from [10] (which is a constant-length modification of Algorithm 6 from [20]). By initially computing a table \(T[i]=(2i+1)P\) for all positive digits (in a phase known as the precomputation), the main loop can process the scalars one digit at a time, reducing the number of iterations by a factor \(w-1\). To be resistant against (cache-)timing attacks, each lookup requires a linear pass over the entire table, and there can be no branches dependent on \(c_j,k_j\). An additional consideration is that the regular recoding algorithm requires the subscalars to be odd. To ensure this, the subscalars are modified to be odd in line 3, and at the end the result is corrected at the cost of two point additions.
However, both [28] and the subsequent [27] suffer from a lack of rigor. First and foremost, no proof has been presented for correctness of the scalar multiplication algorithm. The \(\lambda \)-projective group law formulas are incomplete, so it could potentially fail in corner cases. It also relies on ad-hoc tricks for constant-time scalar decomposition, with no proof of correctness or length guarantees.
2.3 GLS254 and the Choice of Parameters
Previous works have benchmarked their implementation of scalar multiplications over a GLS curve specially crafted for efficiency at the 128-bit security level. For the GLS254 curve, one chooses \(m = 127\), such that the base field can be defined as \(\mathbb {F}_{q} \equiv \mathbb {F}_{2}[z]/(z^{127} + z^{63} + 1)\) and its quadratic extension as \(\mathbb {F}_{q^2} \equiv \mathbb {F}_{q}[u]/(u^2 + u + 1)\). The curve coefficients should be chosen to have minimal Hamming weight such that multiplying by them is as efficient as possible. We performed a parameter search that reproduced the curve chosen at [27]. By fixing \(a' = u\) and searching for the shortest \(b = (z^i + 1)\) such that the curve has order 2r for prime r, we were able to confirm that \(i = 27\) is the smallest choice, giving a 254-bit r. This means that a multiplication by b can be computed with a single shifted addition.
To protect against Weil descent and generalized Gaudry-Hess-Smart (gGHS) attacks [15, 18], several precautions must be taken. We pick m to be prime, as is the case for GLS254. In addition, the choice of b must be verified to not allow the attack, which happens with negligibly small probability for random b [17]. We used the MAGMA implementation of [8] available at https://github.com/JJChiDguez/gGHS-check to clear our particular choice. This particular check, together with the curve generation method geared towards efficiency, satisfies rigidity concerns [4]. We stress that the ECDLP in binary curves remains infeasible for the parameter range used in this work [12].
3 Scalar Multiplication in GLS Curves
In this section, we begin by presenting a new scalar multiplication variant for binary GLS curves. It combines the Shamir-Straus’ trick [9] for multiple scalar multiplication with a new table compression technique using the GLS endomorphism. We refer to it as the 2DT variant because it builds a two-dimensional table \(T[i,j]=iP+j\psi (P)\) instead of \(T[i]=iP\). Then, in Subsect. 3.2, we prove that the GLS scalar multiplication algorithms are exception-free.
3.1 The 2DT Variant
As in some fast variants of the Shamir-Straus’ trick [9] for multiple scalar multiplication, the idea is to precompute \(T[i,j]=iP + j \psi (P)\) for odd i, j. In the scalar multiplication loop, we then save roughly one point addition per iteration of the main loop by computing \(2Q+T[i,j]\) instead of \(2Q+T[i]+\psi (T[j])\).
This method was previously deemed noncompetitive due to the blowup in the size of the table. Because the subscalar regular recoding uses signed digits, we need to efficiently retrieve \(s_1iP + s_2j\psi (P)\) for any \(i, j \in \{1, ..., 2^{w-1}-1\}\) and sign combination \(s_1, s_2 \in \{\pm 1\}\). The standard approach would be to build a table of \(iP\pm j\psi (P)\) and then use conditional negations to get the two other combinations. The 2D table would then store \(2^{2(w-2)+1}\) points. Even with specialized formulas for the precomputation, the cost in terms of storage and field operations is too high compared to the conventional 1DT algorithm.
The crucial new observation is that the efficiently computable GLS endomorphism \(\psi \) can also be used to compress the 2D table by a factor of 2. As \(\psi ^2(P)=-P\) for any \(P\in \mathcal {G}\), we obtain the identity
It implies that we can generate all combinations from a table that only stores \(iP+j\psi (P)\) for positive i, j. The rest of the combinations can be efficiently retrieved using conditional negations and conditional applications of \(\psi \).
This compression trick not only halve the amount of precomputation needed, but also halves the time needed to do a linear pass through the table in the main loop. With new specialized group law formulas for the precomputation (see Sect. 5), the 2DT algorithm is able to compete for the constant-time scalar multiplication speed record (see Sect. 7). The 2DT variant is presented in Algorithm 2. For \(w=2\), the only difference is that a complete formula must be used for \(2Q+P_1\) at \(i=1\) as well.
3.2 Proof of Exception-Free Scalar Multiplication
We will now prove that the scalar multiplication algorithms presented here and in [28] (with a minor modification) is correct on all valid inputs. The core issue is that the underlying \(\lambda \)-projective group law formulas from [28] are not complete, meaning that they output the wrong result in some corner cases. Without these exceptions, correctness would be trivial.
One could explicitly handle these exceptions in constant time using complete formulas, but this would come at a high performance cost. Here, we prove that exceptional cases can only occur in the last iteration(s) of the main loop. By using complete formulas at the very end, correctness is ensured at only a minor performance penalty.
For clarity, the proof will be tailored to the 2DT algorithm. However, it can be easily adapted to the 1DT algorithm. The proof can be seen as a two-dimensional extension of the argument from Proposition 1 in [6]. We will for now assume that the scalar decomposition produces subscalars of bit-length at most \(m+1\), and defer the discussion about how to achieve this to Sect. 4. The proof crucially relies on the structure of the lattice discussed in [13, 14] that emerge in the GLV method for scalar decomposition;
Here r is the large prime order of \(\#E'(\mathbb {F}_{q^2})=hr\) and \(\mu \) the eigenvalue of \(\psi \) restricted to \(\mathcal {G}\). For our purposes, it is very useful to think about \(\mathcal {L}\) as the lattice of decompositions of zero (as done in [10]).
Our proof requires that the norm of the shortest non-zero vector in \(\mathcal {L}\) is at least \((q-1)/\sqrt{2}\). The structure of \(\mathcal {L}\) is determined by the order of \(E'(\mathbb {F}_{q^2})\). The bound required on the shortest norm might not be obtainable in general, so we focus on the subclass of GLS curves most relevant for cryptography. As the (affine) point (0, 0) is always in \(E'(\mathbb {F}_{q^2})\), 2|h. To ensure that the discrete log problem in \(\mathcal {G}\) is as hard as possible, the optimal choice is to pick a curve with \(h=2\), which is easy in practice. Restricting our proof to this subclass of GLS curves, we can give an explicit solution to the SVP in \(\mathcal {L}\).
Lemma 1
Let \(q=2^m\). Let \(E'\) be a binary GLS curve with \(\#E'(\mathbb {F}_{q^2}) = 2r\) for an odd prime r. Let \(E/\mathbb {F}_q\) be the curve such that \(E'\) over \(\mathbb {F}_{q^2}\) is the quadratic twist of \(E(\mathbb {F}_{q^2})\). Define
where t is the trace of the q-th power Frobenius endomorphism on E. Then \(v_1,v_2\) form an orthogonal basis for the lattice \(\mathcal {L}\). \(\Vert v_1 \Vert = \Vert v_2 \Vert = \sqrt{r} = \min _{v\in \mathcal {L}\setminus \{0\}} \Vert v \Vert \).
Proof
Smith gives in Sects. 6 and 4 of [32] the basis \(v'_1 = (q-1, -t)\), \(v_2\) for \(\mathcal {L}\). However it is not orthogonal and \(\Vert v'_1 \Vert =\sqrt{2r}\). The latter follows from \(\#E'(\mathbb {F}_{q^2})=(q-1)^2 + t^2\) (see Theorem 2 in [13]). We make the small adjustment of replacing \(v'_1\) with \(v_1 = v'_1 - v_2\). It can easily be checked that the basis \(v_1, v_2\) is orthogonal and that \(\Vert v_1 \Vert = \Vert v_2 \Vert = \sqrt{r}\). \(\square \)
For the sake of proving scalar multiplication exception-free, the importance of Lemma 1 is showing that the minimal norm in \(\mathcal {L}\) is large. This is what enables the subsequent proof to succeed. Note that we also require m to be prime, which is needed for security reasons anyways.
Theorem 1
Let the notation be as in Lemma 1. Let \(m>4\) be prime and \(2\le w\le m\). Then Algorithm 2 is exception-free.
Proof
Let us start by identifying the exceptional cases of the \(\lambda \)-projective formulas. The formula for \(P+Q\) breaks down whenever \(P=\pm Q\), \(P=\infty \) or \(Q=\infty \). The point \(\infty \) does not have a \(\lambda \)-affine representation, so these last two cases are only a concern when the points are \(\lambda \)-projective. The 2P formula has no exceptional cases. Finally, the atomic formula for \(2Q+P\) breaks down when \(P = \pm 2Q\) or \(Q = \infty \).
We will argue that all the exceptions that can occur in Algorithm 1 encode an element of \(\mathcal {L}\). By this, we mean that they define some \(z_1, z_2 \in \mathbb {Z}\) such that \(z_1P + z_2\psi (P) = \infty \). Then \((z_1, z_2)\in \mathcal {L}\).
If \((z_1, z_2) \ne (0,0)\), we can show that either \(|z_1|\) or \(|z_2|\) must be at least an m-bit integer. \(v_1,v_2\) form an orthogonal basis of \(\mathcal {L}\), and are both a solution to the SVP in \(\mathcal {L}\) with norm \(\sqrt{r}\). Using the Hasse bound we get that \(\Vert v_1 \Vert , \Vert v_2 \Vert \ge (q-1)/\sqrt{2}\). Now assume for contradiction that \(|z_1|, |z_2| < q/2\). Then
This is a contradiction, since \(v_1, v_2\) have minimal norm in \(\mathcal {L} \setminus \{ {\textbf {0}} \}\). Thus, it must be the case that \(|z_1|\ge q/2\) or \(|z_2|\ge q/2\).
We now have all the tools needed to prove that no exceptions occur in Algorithm 2, and we will start with the precomputation stage. Assume that an exception did occur in the computation of \(iP +j\psi (P)\) for some odd \(i,j\in \{1, \ldots , 2^{w-1}-1\}\). \(P, \psi (P) \ne \infty \), so there can only be an exception if \(iP = s j\psi (P)\) for some \(s\in \{\pm 1\}\). Then \((i, -sj)\in \mathcal {L} \setminus \{ {\textbf {0}} \}\). However, this is a contradiction, as neither \(|i|=i\) nor \(|-sj|=j\) are m-bit integers.
Next is the main loop. Let \(Q_i=z_{1,i}P+z_{2,i}\psi (P)\) denote the value of Q after iteration i. No exception occurred in the precomputation stage, meaning Q is correctly initialized to \(Q_{\ell -1} = \overline{k}_{1,\ell -1}P + \overline{k}_{2,\ell -1}\psi (P)\). Then at iteration i,
Observe that as long as no exceptions occur, we have the invariant that
Assume the first exception occurs at iteration i. The \(w-2\) doublings are exception-free, so the exception must have been caused by the computation of \(2Q_{i+1}+P_i\). \(Q_{i+1}\) cannot have been \(\infty \). This is because \(2Q\ne \infty \) for any \(Q \ne \infty \) and the incomplete \(2Q+P\) formula does not output \(\infty \) for any P, Q on the curve. Hence, the first exception must have occurred because \(2^{w-1}Q_{i+1}=sP_i\) for an \(s\in \{\pm 1\}\). This is equivalent to
Define \(z'_{j,i}=2^{w-1}z_{j,i+1}-s\overline{k}_{j,i}\). Notice that \(-s\overline{k}_{j,i}\) is a valid digit of the regular recoding. The invariants for \(|z_{j,i}|\) also hold for \(|z'_{j,i}|\). Hence, \((z'_{1,i}, z'_{2,i}) \in \mathcal {L} \setminus \{ {\textbf {0}} \}\).
Since m is prime, it holds for all \(2 \le w \le m\) that
This means that \(|z'_{1,i}|\) and \(|z'_{2,i}|\) are of at most \(m-1\) bits while \(i\ge \ell - \lceil \frac{m}{w-1} \rceil + 1\). The implication is that the first exception could not have occurred in these iterations, so we must have that \(i\le \ell - \lceil \frac{m}{w-1} \rceil \).
For \(w=2\), this means that \(i\le 1\). For all other w, this means that \(i=0\). But these are exactly the iterations that use a complete formula for the computation of \(2Q_{i+1}+P_i\) for the respective values of w. Thus, it is impossible for the first exception to occur in these last iterations. The conclusion is that there can be no exception in the main loop. \(\square \)
4 Scalar Decomposition with Parity and Length Guarantees
The GLV method for scalar decomposition needs a bit of care when required to run in constant-time while preserving length guarantees. The GLV method uses a reduced basis \(\{u_1, u_2\}\) of some sublattice \(\mathcal {L}'\) of \(\mathcal {L}\) (see Sect. 3.2) to solve the CVP problem for (k, 0) in \(\mathcal {L}'\) using Babai rounding [14]. For a given basis, there exist unique constants \(N, \alpha _1, \alpha _2\in \mathbb {Z}\) such that
where \(\beta _i = \frac{\alpha _i}{N}k\). The subscalars \(k_1, k_2\) are then computed as
where \(b_i = \lceil \beta _i \rfloor \). The magnitude of the subscalars can then be bounded by some expression depending on the norm of the basis vectors.
The issue for constant-time implementations is the computation of the \(b_i\)’s. Ideally we would compute them using divisions, but unfortunately divisions do not run in constant time in most processorsFootnote 1.
The standard solution was first introduced in [5] and further analyzed in [10]. The idea is to approximate the computation of the \(b_i\)’s using integer divisions by powers of 2, which can be implemented in constant-time using shifts. Choose some integer d such that \(k < 2^{d}\), and precompute the constants \(c_i = \lfloor \frac{\alpha _i}{N} 2^d \rceil \). Then at runtime compute \(b_i\) as \(b'_i = \lfloor \frac{c_i}{2^d}k \rfloor \).
This approach introduces rounding errors. It was shown in Lemma 1 of [10] that \(b'_i\) will either be \(b_i\) or incorrectly rounded down to \(b_i-1\). This does not affect the correctness of the decomposition. However, it does negatively impact the bounds on \(|k_1|, |k_2|\). If the bounds become too loose, we might need more iterations of the main loop of the scalar multiplication to ensure correctness.
We present Algorithm 3 for constant time scalar decomposition using the optimal basis from Lemma 1. In Lemma 2 we prove that it outputs subscalars of at most \(m+1\) bits. Without rounding errors it would be m, but the extra bit does not affect the number of iterations of the main loop (see Corollary 1).
In fact, not handling rounding errors leads to more efficient scalar multiplication. Using the fact that the rounding errors are one-sided, we introduce a new trick to ensure that the \(k_1, k_2\) are always odd, without affecting the length guarantees. The two point additions needed to correct for even subscalars (see Subsect. 2.2) are no longer needed, leading to a simpler and faster algorithm. Note that this optimization applies to both 1DT and 2DT scalar multiplication.
Lemma 2
Let the notation be as in Lemma 1 and assume that \(h=2\) and \(m>4\). Algorithm 3 on input \(k\in [1,r-1]\) outputs a valid decomposition \(k_1,k_2\). The subscalars are odd and \(|k_1|, |k_2|<2q\).
Proof
Let \(k_1\), \(k_2\) be the output of the GLV method on input k and let \(k'_1, k'_2\) the output of Algorithm 3. We start with correctness. Per definition, \((k,0) + \mathcal {L} \in \mathbb {Z}^2 / \mathcal {L}\) is the set of valid decompositions of k. Algorithm 2 produces \((k'_1, k'_2)\) by adding integer multiples of \(v_1\) and \(v_2\) to (k, 0). Hence, \((k'_1, k'_2) \in (k,0)+\mathcal {L}\).
Next, let’s bound the magnitude of the subscalars. The basis vectors are orthogonal with norm \(\sqrt{r}\). By the same argument as in Lemma 3 of [13] it follows that \(\Vert (k_1, k_2) \Vert \le \sqrt{r/2}\). To make the analysis independent of r, we can upperbound it as \(r \le (q+1)^2 /2\) using the Hasse bound. Then \(\Vert v_1 \Vert , \Vert v_2 \Vert \le (q+1)/\sqrt{2}\) and \(\Vert (k_1, k_1) \Vert \le (q+1)/2\).
It can be easily verified that \(\alpha _1, \alpha _2, N\) are specified such that \((k,0)=\beta _1 v_1 + \beta _2 v_2\). Since \(k<r\le (q+1)^2/2 \le q^2 \le 2^d\), it follows from Lemma 1 of [10] that \(b'_i\) is either \(b_i\) or incorrectly rounded down to \(b_i-1\).
Let \(r_i\) be the bit that is 1 if such a rounding error occurred when computing \(b'_i\). Let \(s_i\) be the bit that is 1 if \(v_i\) was subtracted from \((k_1,k_2)\) at line 8. Then using the triangle inequality, we can derive the bound on the subscalars.
Finally, we will show that \(k'_1,k'_2\) are odd. The proof of Lemma 1 establishes that t is odd. \(\frac{(q-1)+t}{2} = \frac{(q-1)-t}{2} + t\), meaning exactly one of the coordinates of \(v_1\) are odd. By symmetry, only the other coordinate of \(v_2\) is odd. Because \(\alpha _1 = 2\Big (\frac{(q-1)+t}{2}\Big )\), \(\alpha _1 \equiv 0\) (mod 4) exactly when the 1st coordinate of \(v_2\) is odd. Then \(u_i\) is the basis vector with the odd i-th coordinate. Subtracting \((k'_1,k'_2)\) by \(u_i\) flips the parity of \(k'_i\) but leaves the parity of the other subscalar unchanged. \(p_i = 1\) exactly when \(k_i\) is even, meaning that the subscalars output are odd. \(\square \)
Corollary 1
Let \(m > 4\) be a prime number. For any window size \(2<w\le m\), the number of digits needed to recode the subscalars output by Algorithm 3 is the same as one would need for the subscalars output by the GLV method with no rounding errors. For \(w=2\), one more digit is required.
5 New Formulas for Faster Precomputation
The 2DT scalar multiplication variant represents a different strategy for utilizing precomputation. The table grows quadratically faster than its 1DT counterpart, so reducing the cost of its precomputation is crucial. For both the 1DT and 2DT variants, we present more efficient strategies for the precomputation stage. The 2DT precomputation (Algorithm 5) uses the 1DT precomputation (Algorithm 4) as a subroutine, which makes a case for the fairness of our optimization efforts.
The precomputation algorithms depend on several new atomic group law formulas. Compared to doing the operations using the existing formulas from [28], they provide a significant saving in the number of field multiplications and squarings needed. Because these formulas are derived by combining the original group law formulas, they do not introduce additional exceptions. The new formulas that are nontrivial to derive are included in the Appendix of the full version [1]. An overview of the cost of the specialized \(\lambda \)-projective group law formulas is given in Table 1.
It follows from the same argument as in Theorem 1 that the new precomputation algorithms are exception-free. At any step we compute \(iP+j\psi (P)\) for small coefficients i, j, where at least one of them are nonzero. Then \((i,\pm j) \notin \mathcal {L}\), meaning that there cannot be any exceptions.
6 Binary Field Arithmetic for Arm
This section details our Arm implementation of the GLS254 curve. The focus will be on the field arithmetic. It was implemented specifically for the platform, relying heavily on 128-bit Arm Neon vector instructions to achieve high performance. The rest of the curve implementation is almost exclusively written in C, and therefore does not differ much from the Intel implementation from [28].
Specifically, our implementation targets Armv8 AArch64, which introduces some new useful instructions for cryptographic implementations. In particular, we take advantage of the new PMULL vector instruction for 64-bit binary polynomial multiplication, a direct analogue of PCLMULQDQ for Intel. For convenience of implementation, we use C intrinsics for the Arm Neon vector instructions.
This section first details the implementation of the base field \(\mathbb {F}_q\) with \(q=2^m\) and \(m=127\), then how we implement the quadratic extension field \(\mathbb {F}_{q^2}\) on top.
6.1 Arithmetic in the Base Field \(\mathbb {F}_q\)
Representation of Elements. One benefit of the choice of field, is that the bit vector representation of \(a \in \mathbb {F}_q\) is 127 bits long, meaning that we can fit it in a single 128-bit Neon vector register. We denote a[0], a[1] as respectively the least significant and most significant word of the 128-bit register that stores \(a \in \mathbb {F}_{2^{127}}\). a[0] stores the bit vector for the terms \(z^0\) to \(z^{63}\), a[1] terms \(z^{64}\) to \(z^{126}\). We will sometimes use the notation \(a=\{a[0], a[1]\}\) to show the contents of the register.
An efficiency issue for Arm AArch64, compared to AArch32, is that it cannot reference the upper word a[1] of a 128-bit register as a separate 64-bit register [16]. Instead, one needs to use the Arm Neon instruction EXT. It takes two registers a, b and outputs \(\{a[1], b[0]\}\). The lower half of this output can then be referenced for further computation. Table 2 gives an overview of all the Neon instructions that we used for our implementation.
Polynomial Multiplication. The polynomial multiplication algorithm takes as input two binary polynomials \(a,b\in \mathbb {F}_q\) and outputs their polynomial product c. The degree of c can be up to twice the degree of the operands. Hence it must be stored in two 128-bit vector registers \(c_0, c_1\), where \(c_0\) stores the lower half.
For polynomial multiplication we use the Arm Neon implementation from [16]. It efficiently performs 128-bit polynomial multiplication using the new PMULL instructions. While they managed to implement it using 3 multiplications with the Karatsuba algorithm on AArch32, the high number of EXTs this would require on AArch64 meant that they instead opted for an algorithm with an extra PMULL.
Polynomial Squaring and Field Multi-squaring. Polynomial squaring of an \(a\in \mathbb {F}_q\) can be trivially implemented as \(c_0 \leftarrow \text {pmull\_bot}(a,a)\), \(c_1 \leftarrow \text {pmull\_top}(a,a)\).
For multi-squaring in settings where it does not need to be computed in constant time, we implemented the technique from [2, 7]. It uses lookup tables that are precomputed offline to compute the reduced result of \(a^{2^k}\). However, for smaller Arm processors like the Cortex-A55, this method only outperforms the naive loop implementation for \(k>12\). This is a lot higher than the threshold of \(k>5\) for Intel [28]. It is an example of the higher cost of memory access on smaller devices, which results in a lower yield for precomputation strategies.
Modular Reduction. To compute a field multiplication or squaring, the result of the polynomial algorithm must be reduced modulo f(z). We here present novel algorithms for efficient modular reduction, using exclusively Arm Neon vector instructions. Like the polynomial multiplication algorithm from [16], they attempt to minimize the number of accesses to the top half of the 128-bit registers, each of which incurs the cost of an EXT.
For reducing the result of a polynomial multiplication, we use Algorithm 6. It implements the lazy reduction technique from [27]. Instead of reducing f(z), we reduce by the redundant trinomial \(z \cdot f(z) = z^{128}+z^{64}+z\). Reductions by zf(z) are roughly 40% faster than proper reductions by zf(z). The result can have degree up to 127 instead of 126, but as the result still fits in a 128-bit register, this makes no difference. As \((c \text { mod } zf(z))\) mod \(f(z) = c\) mod f(z), one can easily recover the properly reduced result from the output of the lazy reduction. This is done by conditionally adding \(z^{63}+1\) to it when bit 127 is set.
It is possible to reduce a squaring slightly faster, exploiting the fact that the result only has bits set at even positions. Thus, we can remove the logic from Algorithm 6 that handles the carry of bit 191 from the left shift by 1. Concretely, this means removing lines 2 and 5, and replacing \(t_2\) by \(t_1\).
Field Inversion. Field inversion is done in the same way as described in [28], using the Itoh-Tsujii algorithm [19]. We generated our addition chain for \(m-1=126\) using McLoughlin’s addchain library [26].
The cost of a field inversion is therefore \(m-1\) squarings and 9 multiplications. The steps after 30 involve multi-squarings with \(k>12\). When the inversion does not have to be in constant time, these steps can then be sped up using the table-based multi-squaring approach.
6.2 Arithmetic in the Extension Field \(\mathbb {F}_{q^2}\)
The elements of \(\mathbb {F}_{q^2}\) can be represented as polynomials \(a_1u + a_0\), with coefficients \(a_0, a_1 \in \mathbb {F}_q\). Therefore, we need two 128-bit registers to represent them. The extension field arithmetic can be implemented from the base field arithmetic, using the identities presented in [28].
In [27], Oliveira et al. present an algorithm for simultaneously reducing both coefficients of an element in \(\mathbb {F}_{q^2}\) at the cost of only a single base field reduction. We have included the Arm Neon implementation in Algorithm 7.
However, the reduction algorithm requires the field elements to be kept in an interleaved representation. For an \(a\in \mathbb {F}_{q^2}\), let \(a_0, a_1\) be the 128-bit registers storing each of its coefficients. Then the interleaved representation of a is
Note that \(a'_0\), \(a'_1\) store \(a_0\) in the lower half and \(a_1\) in the upper half. The larger input to the reduction algorithm must also be in an interleaved representation. Let \(c_0, c_1, c_2, c_3\) be the non-interleaved 128-bit registers storing the result of a polynomial multiplication or squaring. Because this result is computed using the identities from [28], the result is already reduced as a polynomial in u, but has coefficients that need further reduction in \(\mathbb {F}_q\). Then \(c_0\), \(c_1\) store the unreduced constant coefficient and \(c_2\), \(c_3\) the other. The interleaved representation \(c'_0, c'_1, c'_2, c'_3\) of these unreduced coefficients continue the pattern from (3). \(c'_0\), \(c'_1\) are \(c_0\), \(c_2\) interleaved, and \(c'_2\), \(c'_3\) are \(c_1, c_3\) interleaved.
In order to reap the benefits of the reduction algorithm, we had to implement the \(\mathbb {F}_{q^2}\) arithmetic directly in the interleaved representation. To do this, we manually merged and interleaved the base field algorithms to compute the identities from [28]. The only exception is inversion, where a standard base field inversion is used as a subroutine, which again uses all the arithmetic operations discusses in the previous section. While the abstraction between base field and extension field is somewhat broken for the sake of performance, the base field implementation is still the crucial foundation.
7 Results and Discussion
Our implementations, together with SAGE scripts for verification and operation counts, can be found at https://github.com/dfaranha/gls254.
7.1 Operation Counts for Binary GLS Scalar Multiplication
Throughout our work, we have used field operation counts as a measure of the complexity of the scalar multiplication variants. With an understanding of the relative cost of the operations, the count gives a platform-independent estimate of the relative performance of the algorithms. In particular, it guided our choice of window size. However, it crucially does not capture the space-time trade-offs of a particular architecture. For the variants discussed in this paper, which are precisely different strategies for how to use space, this trade-off has a significant impact. This will be apparent in the next subsection.
Table 3 gives an overview the operation counts for the variants. Additions and multiplications by the curve coefficients are ignored due to their insignificant impact on performance. We include the costs of the 1DT algorithm without the new formulas and scalar decomposition to highlight the impact of our contributions. For the sake of fairness, it has been modified to be exception-free in the same way as the others.
As expected, the 2DT algorithm spends more time on precomputation and less in the main loop. We see that the model predicts \(w=5\) to be the sweet spot for 1DT and \(w=4\) for 2DT. Notably, 2DT \(w=3\) is predicted to be faster than 1DT \(w=5\) using only half the amount of space.
For a simpler comparison, we estimate the total cost in terms of field multiplications. The relative cost of each operation will of course differ from processor to processor, so we tried to go for the middle-ground based on our benchmarks. With this simplification, the model predicts that 1DT with \(w=5\) should be \(2.7\%\) faster from our contributions. The 2DT approach with \(w=4\) is predicted to be \(7.2\%\) faster than 1DT in previous work with \(w=5\), and \(4.6\%\) faster than 1DT with \(w=5\) from this work. Non-constant-time field inversion is used to convert points from projective to affine in the precomputation table only, since it does not depend on the (secret) scalar.
7.2 Implementation Timings
We start by describing our benchmarks for the Armv8 AArch64 implementation, written from scratch. We used the ODROID C4 single board computer, as we wanted a smaller device that could be representative for the majority of Arm devices. It comes with a Quad-Core Cortex-A55, which is considered a mid-range processor. We employ clang from LLVM 13 with optimization level -O3.
The benchmarks for our field implementation are presented in Table 4. Notice that non-constant time inversions that use lookup tables are roughly \(33\%\) faster.
Table 5 presents our scalar multiplication timings in GLS254 and comparisons to related work. Compared to Intel, there are not a lot of efficient implementations specialized for Arm at the 128-bit security level. Four\(\mathbb {Q}\) [10] is the closest competitor on Intel, and they also provide specialized implementations for Arm [25]. We benchmarked their Armv8 AArch64 implementation on our machine and included their Armv7 timings from [25] for the sake of fairness. A notable outlier is Lenngren’s implementation for Curve2559, which is a much closer competitor on Arm than any Curve25519 implementation on Intel.
As the first GLS254 implementation for Armv8, we are able to claim the constant-time scalar multiplication speed record by \(34.5\%\) in comparison to the previous state of the art. Contrary to the operation counts in Table 3, it is the 2DT \(w=3\) algorithm that is the superior variant. We do not compare against [24] due to the radically different choices of target platform.
For our Intel implementation, we extended the AVX-accelerated code from [31] with the new formulas and 2DT variant. Due to space limitations, we report timings for field arithmetic in the Appendix of the full version [1]. The scalar multiplication benchmarks are presented in Table 6. We benchmarked our code on an older Core i7 4770 Haswell processor, and a Core i7 7700 Kaby Lake as the closest to the Skylake in [27]; both using clang from LLVM 13 and optimization level -O3. For 1DT \(w=5\), we achieve small speedups of 4.8% in Haswell and 4.1% for Skylake over the previous state of the art. The 2DT \(w=3\) variant achieves a further speedup of 2%. Surprisingly, 2DT \(w=4\) performs relatively poorly due to expensive conditional moves within the longer linear pass. The cumulative speedup over previous work is around 6% on both machines. In comparison to Four\(\mathbb {Q}\), our timings are 24% faster and set a new speed record for constant-time scalar multiplication in Intel processors.
Removing the linear pass for the sake of experimentation, 2DT \(w=4\) outperforms the other variants on all the platforms benchmarked, as predicted by the operation counts. The relative cost of the linear pass seems to be the determining factor for how much of a speedup the new approach yields in practice. In general, the linear pass incurs a relatively high cost, favoring solutions that minimize the size of the lookup table. Hence, we see that the 2DT \(w=3\) variant claims the speed record across the board, using only half as much space as the previous record holder 1DT \(w=5\).
Notes
References
Aardal, M.A., Aranha, D.F.: 2DT-GLS: faster and exception-free scalar multiplication in the GLS254 binary curve. Cryptology ePrint Archive, Paper 2022/748 (2022). https://eprint.iacr.org/2022/748
Ahmadi, O., Hankerson, D., Rodríguez-Henríquez, F.: Parallel formulations of scalar multiplication on Koblitz curves. J. UCS 14(3), 481–504 (2008)
Bernstein, D.J.: Curve25519: new Diffie-Hellman speed records. In: Yung, M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 207–228. Springer, Heidelberg (2006). https://doi.org/10.1007/11745853_14
Bernstein, D.J., Lange, T.: SafeCurves: choosing safe curves for elliptic-curve cryptography. https://safecurves.cr.yp.to/
Bos, J.W., Costello, C., Hisil, H., Lauter, K.: High-performance scalar multiplication using 8-dimensional GLV/GLS decomposition. In: Bertoni, G., Coron, J.-S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 331–348. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40349-1_19
Bos, J.W., Costello, C., Longa, P., Naehrig, M.: Selecting elliptic curves for cryptography: an efficiency and security analysis. J. Cryptogr. Eng. 6(4), 259–286 (2016)
Bos, J.W., Kleinjung, T., Niederhagen, R., Schwabe, P.: ECC2K-130 on cell CPUs. In: Bernstein, D.J., Lange, T. (eds.) AFRICACRYPT 2010. LNCS, vol. 6055, pp. 225–242. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12678-9_14
Chi, J.-J., Oliveira, T.: Attacking a binary GLS elliptic curve with Magma. In: Lauter, K., Rodríguez-Henríquez, F. (eds.) LATINCRYPT 2015. LNCS, vol. 9230, pp. 308–326. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22174-8_17
Ciet, M., Lange, T., Sica, F., Quisquater, J.-J.: Improved algorithms for efficient arithmetic on elliptic curves using fast endomorphisms. In: Biham, E. (ed.) EUROCRYPT 2003. LNCS, vol. 2656, pp. 388–400. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-39200-9_24
Costello, C., Longa, P.: Four\(\mathbb{Q}\): four-dimensional decompositions on a \(\mathbb{Q}\)-curve over the Mersenne prime. IACR Cryptol. ePrint Arch. 565 (2015)
Faz-Hernández, A., Longa, P., Sánchez, A.H.: Efficient and secure algorithms for GLV-based scalar multiplication and their implementation on GLV-GLS curves (extended version). J. Cryptogr. Eng. 5(1), 31–52 (2015)
Galbraith, S.D., Gaudry, P.: Recent progress on the elliptic curve discrete logarithm problem. Des. Codes Cryptogr. 78(1), 51–72 (2016)
Galbraith, S.D., Lin, X., Scott, M.: Endomorphisms for faster elliptic curve cryptography on a large class of curves. In: Joux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479, pp. 518–535. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01001-9_30
Gallant, R.P., Lambert, R.J., Vanstone, S.A.: Faster point multiplication on elliptic curves with efficient endomorphisms. In: Kilian, J. (ed.) CRYPTO 2001. LNCS, vol. 2139, pp. 190–200. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44647-8_11
Gaudry, P., Hess, F., Smart, N.P.: Constructive and destructive facets of Weil descent on elliptic curves. J. Cryptol. 15(1), 19–46 (2002)
Gouvêa, C.P.L., López, J.: Implementing GCM on ARMv8. In: Nyberg, K. (ed.) CT-RSA 2015. LNCS, vol. 9048, pp. 167–180. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16715-2_9
Hankerson, D., Karabina, K., Menezes, A.: Analyzing the Galbraith-Lin-Scott point multiplication method for elliptic curves over binary fields. IEEE Trans. Comput. 58(10), 1411–1420 (2009)
Hess, F.: Generalising the GHS attack on the elliptic curve discrete logarithm problem. LMS J. Comput. Math. 7, 167–192 (2004)
Itoh, T., Tsujii, S.: A fast algorithm for computing multiplicative inverses in GF(\(2^m\)) using normal bases. Inf. Comput. 78(3), 171–177 (1988)
Joye, M., Tunstall, M.: Exponent recoding and regular exponentiation algorithms. In: Preneel, B. (ed.) AFRICACRYPT 2009. LNCS, vol. 5580, pp. 334–349. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02384-2_21
Kales, D., Rechberger, C., Schneider, T., Senker, M., Weinert, C.: Mobile private contact discovery at scale. In: USENIX Security Symposium, pp. 1447–1464. USENIX Association (2019)
Koblitz, A.H., Koblitz, N., Menezes, A.: Elliptic curve cryptography: the serpentine course of a paradigm shift. J. Number Theory 131(5), 781–814 (2011)
Lenngren, E.: AArch64 optimized implementation for X25519. https://github.com/Emill/X25519-AArch64
Liu, Z., Longa, P., Pereira, G.C.C.F., Reparaz, O., Seo, H.: Four\(\mathbb{Q}\) on embedded devices with strong countermeasures against side-channel attacks. In: Fischer, W., Homma, N. (eds.) CHES 2017. LNCS, vol. 10529, pp. 665–686. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66787-4_32
Longa, P.: Four\(\mathbb{Q}\)NEON: faster elliptic curve scalar multiplications on ARM processors. In: Avanzi, R., Heys, H. (eds.) SAC 2016. LNCS, vol. 10532, pp. 501–519. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69453-5_27
McLoughlin, M.B.: addchain: Cryptographic Addition Chain Generation in Go, October 2021. Repository https://github.com/mmcloughlin/addchain
Oliveira, T., López-Hernández, J.C., Aranha, D.F., Rodríguez-Henríquez, F.: Improving the performance of the GLS254. Presentation at CHES 2016 Rump Session (2016)
Oliveira, T., López-Hernández, J.C., Aranha, D.F., Rodríguez-Henríquez, F.: Two is the fastest prime: lambda coordinates for binary elliptic curves. J. Cryptogr. Eng. 4(1), 3–17 (2014)
Oliveira, T., López-Hernández, J.C., Cervantes-Vázquez, D., Rodríguez-Henríquez, F.: Koblitz curves over quadratic fields. J. Cryptol. 32(3), 867–894 (2019)
Renes, J., Costello, C., Batina, L.: Complete addition formulas for prime order elliptic curves. In: Fischlin, M., Coron, J.-S. (eds.) EUROCRYPT 2016. LNCS, vol. 9665, pp. 403–428. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49890-3_16
Resende, A.C.D., Aranha, D.F.: Faster Unbalanced Private Set Intersection. In: Financial Cryptography. LNCS, vol. 10957, pp. 203–221. Springer (2018)
Smith, B.: Easy scalar decompositions for efficient scalar multiplication on elliptic curves and genus 2 Jacobians. CoRR abs/1310.5250 (2013)
Acknowledgements
We thank Aurore Guillevic for discussions about preliminary results of this work, and Jonas Tambjerg Morthorst for help on the early Arm implementation. This work was partially supported by the Danish Independent Research Council under the project 1026-00350B (RENAIS).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Aardal, M.A., Aranha, D.F. (2024). 2DT-GLS: Faster and Exception-Free Scalar Multiplication in the GLS254 Binary Curve. In: Smith, B., Wu, H. (eds) Selected Areas in Cryptography. SAC 2022. Lecture Notes in Computer Science, vol 13742. Springer, Cham. https://doi.org/10.1007/978-3-031-58411-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-58411-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-58410-7
Online ISBN: 978-3-031-58411-4
eBook Packages: Computer ScienceComputer Science (R0)