-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: "safegcd" field and scalar inversion #767
Conversation
- see "Fast constant-time gcd computation and modular inversion" by Daniel J. Bernstein and Bo-Yin Yang https://gcd.cr.yp.to()
I don't have time to look at this right now but this is very neat! related to #730 |
I should add that for variable time inversion we can just make a simple rewrite of _divsteps_62 and it's already faster than existing one: field_inverse: min 2.10us <-- (actually a variable-time implementation) _divsteps_62 is also an ideal target for asm with a bunch of conditional swaps and negates (although I'm not sure how smart gcc is already being about them). |
I see a few TODOs in the code but nothing substantial. Can you comment on how much this is is "WIP" (except the fact that 32 bit is missing?
An interesting data point is #726 (comment) which shows that current versions of gcc -O3 outperform our ASM (but we switched to a default of -O2 a few months ago). So ASM is in general getting less and less interesting for us. |
It's conceptually complete at the top level. So besides general review, the main TODO is a thorough bounds analysis (they are quite tight for performance) and in a couple of places I expect that could require an extra carry step or two. The code also uses arithmetic right shift as things stand (although it's not used in the main inner loop), which I guess will provoke discussion. Of course there's the 32bit version, and some organisational work to share common code in a sensible way. Apart from the code itself, we would probably want to get some outside opinions from mathematicians on the paper's proofs (meaning no offence to the authors), as it is less obvious than Fermat inversion. Indeed, in Appendix F.2 of the paper the authors show that an earlier analysis of a similar algorithm ("A binary recursive gcd algorithm", Stehlé andZimmermann) was incorrect. |
I wonder if you measured against Niels Moller constant-time inversion in Nettle/Botan/GMP? The Yang-Bernstein GCD seems to require significantly more operations compared to algorithm 5 of Fast Software Polynomial Multiplicationon ARM Processors Using the NEON Engine I've outline my performance concerns in Botan issue randombit/botan#1479 (comment) |
@mratsim I have not, but frankly I don't think it could improve on these results because it is performing full-length operations on the operands in each loop iteration. This has been the fundamental performance barrier for all of the binary GCD style algorithms that several contributors have attempted here and elsewhere over the years. So I am focused now on the 2-adic division, "bottom bits" algorithms. This PR performs 62 iterations at a time on the "bottom bits" only (see the ..._divsteps_62 method(s)), and only updates the full sized operands 12 times. It's true safegcd has more total iterations (744 vs 512), but they are much much cheaper. safegcd also defers the full u/v calculation to the post-processing phase (about 25% of total time in this PR), where the matrices can be combined very efficiently. At 256 bits, I suppose it's possible that the NEON vector instructions dramatically change the accounting, and perhaps something similar is possible with x86_64 SIMD extensions; I would be very interested to see the results of either. Of course there are opportunities for SIMD to improve this PR also. |
VERY COOL. One comment on correctness is that inversions are easy and fairly cheap to test at runtime, so at least flaws can be turned into clean crashes instead of incorrect results-- for whatever consolation that is. :) |
One could even implement code that multiplies to check the inverse, and if its wrong redoes the operation using the ladder. This would be a fringe timing channel if it were triggerable (but it's easy to test to be confident that its rare and randomize it if there are untrusted inputs to a constant time version), so I think that wouldn't be a problem. This would essentially guarantee algorithmic correctness at little cost, assuming the software just didn't have an ordinary programming flaw. So think fear of correctness and a shortage of computational number theorists need not stand in the way of this work. |
- definitely needs bounds analysis
There's now 32-bit versions for field and scalar inversion. It needs bounds analysis, carry chains checked, etc., but these are the current timing comparisons with 32bit scalar and field configured (bearing in mind the test machine is x86_64): master: safegcd_inv: |
src/field_5x52_impl.h
Outdated
r[4] = a4 >> 40; | ||
} | ||
|
||
static int secp256k1_fe_divsteps_62(uint16_t eta, uint64_t f0, uint64_t g0, int64_t *t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one thing I noticed when skimming through it: The types of eta
are pretty inconsistent. You use uint16_t
inside the function, return an int
and then store it in an int16_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it should be cleaned up. It's logically a signed value (~ in [-744,744]) in the sense that the algorithm as-written switches on the sign of it, but the actual constant-time code for _divsteps is in "bit-twiddling" style, and I wanted to make it clear that there's no reliance (in this method) on arithmetic right-shift. Conceptually maybe it's (u)int_fast16_t, but I was a bit lazy about dealing with the variable sizes before things worked. Possibly uint64_t (resp. uint32_t for 32bit) makes more sense.
src/field_5x52_impl.h
Outdated
z = (v ^ r) & c1; | ||
v ^= z; r ^= z; r ^= c1; r -= c1; | ||
|
||
eta = (eta ^ (uint16_t)c1) - (uint16_t)c1 - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a related note, seeing integer types smaller than int
makes me somewhat nervous. Most people looking at this line for example won't notice that all the arithmetic here is performed on (signed) int
due to integer promotion rules. (I believe that's not an issue in this line, it's just a general remark.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I just missing a U i.e. 1U or else could you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way integer operations (and this includes bitwise as well as unary operators, shifts and comparisons) is done in C can be pretty unexpected for types smaller than int
: Before any other rules apply, "Integer types smaller than int
are promoted when an operation is performed on them. If all values of the original type can be represented as an int
, the value of the smaller type is converted to an int
; otherwise, it is converted to an unsigned int
" [1]. On our targets, int
is the same as int32_t
, so an uint16_t
for sure fits in an int
. The possibility that an unsigned integer is converted to a signed integer is fucked up because signed overflow is undefined behavior. And compilers have been reported to rely on this [2, 3].
For this line here, this means eta ^ (uint16_t)c1
is evaluated by first converting eta
and (uint16_t)c1
to int
and then performing the ^
. Then for the -
, the right operand (uint16_t)c1
is converted to int
. In those cases, it's easy to see that it doesn't make a difference, at least for our targets.
In general, if you ignore the theoretical cases like int
17 is bits wide, it seems (without a proof) that the only interesting operations are multiplication, left shifts, right shifts, and comparisons, the former two because of possible overflows, comparisons because of issues with comparing signed and unsigned values, and right shifts because it's relevant whether the left operand is signed or unsigned.
The real fuckup here is that the stdint.h
types were supposed to be used for portable arithmetic but portable arithmetic is not possible in C because all integer arithmetic, no matter what type, implicitly depends on the size of int
by the above rule. Our code would probably blow up on a target where int
is 64 bits. (Yes, this exists.)
The typical trick to work around this is to start the arithmetic with an unsigned int value, e.g., (1U * eta) ^ (uint16_t)c1
. No need to point out that this is ugly.
[1] https://wiki.sei.cmu.edu/confluence/display/c/INT02-C.+Understand+integer+conversion+rules
[2] https://stackoverflow.com/questions/24795651/whats-the-best-c-way-to-multiply-unsigned-integers-modularly-safely the link is for C++ but the same rules apply in this case
[3] https://www.cryptologie.net/article/418/integer-promotion-in-c/
iMX.6 (arm7) Before: After: I suppose I shouldn't let the fact that it only matched GMP for the variable time here distract me from the fact that it's 10x faster for the constant time scalar inverse. This will be a big speedup for GMP-less ECDSA validation and general signing on arm. Edit: ah, I guess its not actually using a variable time version of this algo for the variable time ones? Even using the constant time would would still be a big speedup for the non-gmp builds. It would sure be nice to get rid of optional linking against gmp. |
Gmp-less comparison. iMX.6 (arm7) no-asm iMX.6 (arm7) with-asm Obviously this speedup would be greater w/ a variable time version of the new code in use. |
@gmaxwell Agreed that random blinding can be useful in the ways you describe. Still it starts to lean towards your point elsewhere that one could just use the variable-time alg. with blinding anyway. Yes, there's no variable-time in the PR yet. I gave an example timing above from minor changes to _divsteps_62. I did experiment a bit further with the 64bit version and got var-time field inverse down to around ~1.75us, vs. ~3.25us for const-time and ~2.2us for the gmp one. That's already enough to dispense with gmp for 64bit, but I'm sure it can be improved further. For the arm7 numbers (thanks - I should really try and get a hardware setup for this), I am actually a bit surprised at how slow the "Before" field_inverse is (especially relative to gmp for inv_var). I guess this has to be down to fe_mul, fe_sqr being slow (slow mul instruction?), which partially affects the timings for this PR. As things stand, beating 32bit gmp on inv_var would appear to require about 2.5x speedup over this PR. Could you please satisfy my curiosity and give the full output of bench_internal on that arm7? I also have some possible variations of _fe_mul that I could PR which it would be very helpful to get measured on real devices. |
Hello, https://github.com/JeanLucPons/VanitySearch/blob/3ba22e9072db540a445d37471e7969bff92fa504/IntMod.cpp#L380 |
iMX.6 with ASM: iMX.6 with no-asm: |
Yes it's quite similar, but the "divstep" used here should ultimately be faster than the approach there. Section 8 of the safegcd paper discusses some other variants and why they're slower (at least for constant-time case, but probably in general). I'd suggest you check particularly 8.4 which shows that you have to be very careful about the "divstep" you use, since merely clearing 0s and combining to get more 0s is not sufficient to guarantee termination. The comments suggest this is Binary xGCD but I would usually take that to include a size comparison b/w u,v on each iteration, which that code isn't doing. So I guess it's actually a variant of Stehlé and Zimmermann's "Binary Recursive Gcd"? |
The "divstep62" used in VanitySearch is a slightly optimized signed bxcd loop to get zeros a bit faster. |
Hello, |
@JeanLucPons aside, you can benchmark this code easily by applying the patch and running bench_internal. :) This doesn't totally answer your question since he hasn't posted those variable time optimizations but you could scale his figures based on comparing the constant time numbers. |
@JeanLucPons It's an i7 (Haswell) @2.6GHz. The constant-time algorithm does 744 iterations because that's (more than) enough to guarantee completion, but the average number of iterations to send g to 0 is actually ~532. In your code I would first try making the inner loop branchless. If you are able to use __builtin_ctzll or equivalent, then the while(even) loop can be rewritten very simply. However I would suggest you turn your attention first to whether the algorithm is correct i.e. is it guaranteed to terminate for all inputs? A typical Euclidean algorithm always reduces the larger of two variables, and so will terminate. This is the case e.g. for Binary GCD also. A weirder way of saying that is that each iteration produces 0s in the high bits of whichever variable currently has fewer. The 2-adic ("bottom bits") algorithms - including this PR, "binary recursive GCD", Algorithm PM ("Plus-or-Minus") - instead produce 0s in the low bits of whichever variable currently has fewer. This is what the "eta" variable is tracking in this PR (in the paper it's "delta"; eta is just -delta). Even with this property, one has to be careful as section 8.4 of the safegcd paper explains. AFAICT your algorithm does't have a property like this, it alternates between u,v without regard for which of the two variables has made more progress. Are you aware of a proof that it does always terminate? The same question applies to the "BXCD" algorithm there. |
This is really nice! I've skimmed the paper and the code, and this is my understanding (which may help other reviewers):
Some overall comments:
Follow-up questions for understanding:
|
@peterdettman Anyway, in 99.9999999...% of case my implementation ends in less than 11 divstep62 and I know it exists a small class of number where this limit is overpassed and also exceed 12. I don't know if it is the same in your implementation, I must admit that the theorem 11.2 is unclear for me. |
Technically that's implemented-defined but I can't find a reference to a platform where a right shift of a negative value is not implemented as an arithmetic shift. We already rely on it in here since a recent PR: secp256k1/src/ecmult_const_impl.h Line 19 in d567b77
(Okay, there it could easily be avoided by casting to unsigned first.) So I don't think that's a big issue. If this can be avoided for free, yes, we should do it. But otherwise I'd simply accept this. My understanding of the algorithm is not good enough but I suspect it cannot be avoided at zero cost? We have multiple other undocumented assumptions in the code base, e.g., about sizes of some types. I guess we should just document those using a CHECK somewhere that asserts this, e.g., like Bitcoin Core does: |
We discussed random blinding here. I just found this talk which demonstrates a sidechannel attack on the subsequent gcd algorithm if the blinding multiplication is done over the integers (without mod): https://youtu.be/ORGCHkSInjs?list=PLeeS-3Ml-rpqyNMiXWuheOmKAgCkUaass&t=1450 It's not directly related to the discussion here but I found it interesting. |
See rebased/squashed version in #831. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments to test my understanding.
fn = f[len - 1]; | ||
gn = g[len - 1]; | ||
|
||
cond = ((int32_t)len - 2) >> 31; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right in thinking that (cond == 0) corresponds to (len >= 2 && (fn == 0 || fn == -1) && (gn == 0 || gn == -1))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. i.e. if the length is more than 1, and the top words of both f and g contain only a sign bit, then we can shorten the length and keep the sign(s) in the top bits of the new top word(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should clarify that whilst d, e, f, g are arrays of signed values, only the top word is ever negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an easy argument why that's always the case, even with a modulus that's encoded using negative limbs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I overstated it; f can be initialised with negative limbs from the modulus. The update methods only leave a negative value in the top limb. _update_fg is always called at least once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got is.
cond |= fn ^ (fn >> 31); | ||
cond |= gn ^ (gn >> 31); | ||
|
||
if (cond == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clause is there so that when the top nonzero element is 0 or -1, we stuff it into bit 29 of the element below?
How is it guaranteed that at the end this 29th bit is gone (in scalar_encode)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sign bit is stuffed into bits 30, 31 - I'm not sure where 29 comes from. Are you asking about secp256k1_fe_decode_30 (since we are in field code here)? f, g are never decoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion, let me restate:
- All the inner arithmetic supports 31 bits values (+ sign bit)
- Generally both d/e and g/f only store 30 significant bits per limb
- Except when the top limb of f/g are both 0 or -1, in which case it's stuffed into the previous limb, possibly pushing that one to 31 bits.
- D/e strictly only have 30 bits per limb, and only the top limb can be negative (otherwise the decode functions would fail).
Analogously for the 64-bit version, using with s/30/62/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bounds analysis for the update methods was based on s31 values in the matrices produced by _divsteps_30, and s30 values in d, e, f, g. The update methods currently only leave negative values in the top limb of d, e, f, g, but the modulus (and therefore initial f) might have negatives elsewhere.
Your questions have made me realize that the dynamic shortening of f, g can result in a top limb of f or g (or both) that is s31. Probably not an issue, but it might be best to redo bounds analysis for the update methods.
It would be nice add code comments for all these clarifications. |
@real-or-random Definitely. |
* divsteps are needed. */ | ||
eta = -(uint64_t)1; | ||
|
||
for (i = 0; i < 12; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no point in taking advantage of the fact that 741 iterations is enough? (this loop looks like it supports 12*62 = 744 iterations).
@gmaxwell and I have been trying to find example inputs that need a large number of iterations (using a beam search that tries to find low bits that increase the average iteration count (over many high bits that follow) , but the best so far is 0x4212888e98e92e1c5384b37d0bf0655ec3bbad64381e713d6d6b0490d4858c83 (modulo the field order p), which needs 637 iterations.
I've also worked on some code that uses partial evaluation and interval reasoning to split up the range of inputs that have a large bound (using the log2(sqrt(f^2+4*g^2)) formula from the paper), until better bounds for the entire range is found. If my code is correct, both the field and scalar logic only need 737 iterations at most - but proving anything better will take a long time.
Some scratch code is here: https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a limit of 741 I doubt there's much to be gained by saving 3 divsteps, and I prefer the simplicity. Maybe for the 32-bit version saving 9 divsteps could be more viable. Two example approaches: 1) more code (alternate divsteps method plus extra calling code) -> careful about benchmark results neglecting e.g. the instruction cache impact, or branch prediction slot. 2) pass the iteration count to divsteps -> careful not to break compiler's ability to unroll divsteps loop (or handcoded asm).
I do hope that mathematical developments could reduce the max iterations. Checking the limit for the specific prime could gain a few bits as you say. It's unclear to me whether the worst-case input(s) can be generated, but it would be wonderful if "bad" inputs could be cheaply characterised (maybe based on something like the Hamming weight in NAF), then cheaply "fixed" in constant-time (something simple like multiplication by a small constant), so that a new limit would apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we call b(x,y) = ⌊49/17 log2(√(x2 + 4y2)))⌋, with appropriate corrections for small inputs, then I believe these bounds are correct:
- δ = 1: iter ≤ b(|f|,|g|) [from the paper]
- δ < 1: iter ≤ 1 - δ + b(|f|, ((21-δ-1)|f| + |g|) / 21-δ) [δ will only go up the next 1-δ steps, and in each, the g argument is either halved or averaged with f; after that many steps, this formula is an upper bound (corresponding to always averaging)]
- δ > 1: iter ≤ 1 + δ + b(|g|, |g| + (|f| + |g|)/21+δ) [by assuming the next step will have an odd g and using the previous rule, but correcting for the fact that more δ-incrementing even-g steps can happen in between]
With those rules, it seems like I've proven that no input (for either field or scalar order modulus) ever needs more than 736 iterations using the code above (with finds a partition of the input range based on fixing low bits, and using partial evaluation + the bounds above to find a bound <= 736 for each).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: we've proven that no inputs for either field or scalar order need more than 735 iterations. We could do 734 but I don't think it's worth the effort, as by extrapolation it would take around 16 years of CPU time (which is doable, but needs some effort to distribute the work across network nodes etc) - which means that even if proven it would make the computation hard to repeat. The ≤736 bound takes a couple of minutes on to prove with the (somewhat involved) C++ code, and a few hours in Python.
Bound | Proof steps for field | Proof steps for scalar |
---|---|---|
≤741 | 1 | 1 |
≤740 | 21 | 21 |
≤739 | 85 | 85 |
≤738 | 17835 | 17835 |
≤737 | 6753276 | 6753276 |
≤736 | 2879322241 | 2879322241 |
≤735 | 1216906236685 | 1216976994782 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ≤735 proof takes 80 minutes on a 128 core host.
I wouldn't mind doing the ≤734 and wouldn't mind doing-- even without networking code-- if it would actually be even slightly useful, but it'll take me about a month realtime that way (for each field and scalar) (I could do it 4 or 5 days with a networked version). ≤733-- by extrapolation-- would be about 7000 cpu-years and reasonably beyond my ability to do but it could be done (e.g. with the help of an actual supercomputer).
I don't think ≤732 is currently provable using all the kings cycles and all the kings mems, at least not using these techniques.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ≤732 is currently provable using all the kings cycles and all the kings mems, at least not using these techniques.
Surely all the kings men can muster 1M c6g.16xlarge instances for a week (~ 350M USD).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sipa curious to see the code used to prove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elichai https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea
- fgcd.cpp is a heuristic search for large inputs
- fgcd_recurse.{py,cpp} are provers for a maximum bound
src/scalar_4x64_impl.h
Outdated
|
||
secp256k1_scalar_reduce(r, secp256k1_scalar_check_overflow(r)); | ||
|
||
secp256k1_scalar_add(&u, r, &SECP256K1_SCALAR_NEG_TWO_POW_256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is a good reason to do this in the scalar/field representation rather than the signed30/62 representation? There could be a single addition chain that negates (based on the sign of f[0]
) and (conditionally) adds the modulus if the computed inverse is negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good reason - I do it as part of the inverse in the BouncyCastle implementation. As written that requires D to be in [-P, P), which I believe is correct, but so far in secp256k1 I only assumed [-2^256,2^256).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a hard time verifying those bounds, so I wrote a (hopefully faithful) reimplementation of the algorithm here in Python, only replacing the limb decomposition with native integers (which are arbitrary length in Python): https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea#file-fgcd-py
Based on the choice of the limb size and moduli, better bounds may hold, but if I make no assumptions (allowing any modulus, and any limb size from 1 bit up to however many bits the modulus is), I can only verify that the "d" value at the end of the divsteps iterations is in [-2P, P). Care to share you reasoning for why it'd be tighter than that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when I restrict it to even limb sizes of at least 6 bits, [-P,P) seems to hold.
Scratch that, inverse of 19934 modulo 21163, with 6 bit limbs, gives d=-23436.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a while ago now... I recall doing some exhaustive testing for small values and finding an asymptote for the worst-case at around +/- 5*P/6 (actually just a bit larger), but I'm foggy on the details. We could certainly use more analysis here and it's probably best to go with [-2P, P) for the moment if you can prove that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about that... but is that easy/cheap to do? You may need to scan for the first non-zero limb to find the sign; doing that (especially in constant time) may not be that much cheaper than an addition chain to add/subtract a modulus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cheap, the sign is always correct in the top limb for D, E. EDIT: They are in 2's complement basically, on a 2^30 radix to avoid overflow issues in the updates and so that the implicit shift is free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I've kind of accepted [-2P, 2P) for the moment. Just looking for a sign-based fixup for update_de; the ones I've tried so far actually give larger worst-case values in random trials, but perhaps they allow a locally-reasoned proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this convex hull approximation technique also works for bounding the number of divsteps needed! https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea#file-hull-bound-py
If the code is correct (and it looks like it is for sufficiently small inputs), it proves that for any 256-bit modulus M and input in [0..M), no more than 723 divsteps are needed. Still not enough to reduce our number of big step iterations in either 32 or 64 bit, but perhaps there are other gains?
EDIT: if the input is restricted to 0..MODULUS/2, 721 is enough. So close to only needing 24 iterations in 32 bit mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed some changes that should address this issue of the size constraints on D, E. See comments in _update_de method(s). I've only done 64bit, but the pattern should be clear. This latest commit also does the final normalization within the 62-bit format.
@sipa Just so we don't duplicate efforts: I am working on porting this into your new PR. |
@peterdettman Cool, thanks! I was planning on doing so myself soon, but hadn't started. |
Closing in favor of #831. |
24ad04f Make scalar_inverse{,_var} benchmark scale with SECP256K1_BENCH_ITERS (Pieter Wuille) ebc1af7 Optimization: track f,g limb count and pass to new variable-time update_fg_var (Peter Dettman) b306935 Optimization: use formulas instead of lookup tables for cancelling g bits (Peter Dettman) 9164a1b Optimization: special-case zero modulus limbs in modinv64 (Pieter Wuille) 1f233b3 Remove num/gmp support (Pieter Wuille) 20448b8 Remove unused Jacobi symbol support (Pieter Wuille) 5437e7b Remove unused scalar_sqr (Pieter Wuille) aa9cc52 Improve field/scalar inverse tests (Pieter Wuille) 1e0e885 Make field/scalar code use the new modinv modules for inverses (Pieter Wuille) 436281a Move secp256k1_fe_inverse{_var} to per-impl files (Pieter Wuille) aa404d5 Move secp256k1_scalar_{inverse{_var},is_even} to per-impl files (Pieter Wuille) 08d5496 Improve bounds checks in modinv modules (Pieter Wuille) 151aac0 Add tests for modinv modules (Pieter Wuille) d8a92fc Add extensive comments on the safegcd algorithm and implementation (Pieter Wuille) 8e415ac Add safegcd based modular inverse modules (Peter Dettman) de0a643 Add secp256k1_ctz{32,64}_var functions (Pieter Wuille) Pull request description: This is a rebased and squashed version of #767, adding safegcd-based implementations of constant-time and variable-time modular inverses for scalars and field elements, by Peter Dettman. The PR is organized as follows: * **Add secp256k1_ctz{32,64}_var functions** Introduction of ctz functions to util.h (which use `__builtin_ctz` on recent GCC and Clang, but fall back to using a software emulation using de Bruijn on other platforms). This isn't used anywhere in this commit, but does include tests. * **Add safegcd based modular inverse modules** Add Peter Dettman's safegcd code from #767 (without some of his optimizations, which are moved to later commits), turned into separate modules by me. * **Add extensive comments on the safegcd algorithm and implementation** Add a long description of the algorithm and optimizations to `doc/safegcd_implementation.md`, as well as additional comments to the code itself. It is probably best to review this together with the previous commit (they're separated to keep authorship). * **Add tests for modinv modules** Adds tests on the modinv interface directly, for arbitrary moduli. * **Improve bounds checks in modinv modules** Adds a lot of sanity checking to the modinv modules. * **Move secp256k1_scalar_{inverse{_var},is_even} to per-impl files** A pure refactor to prepare for switching the field and scalar code to modinv. * **Make field/scalar code use the new modinv modules for inverses** Actually switch over. * **Add extra modular inverse tests** This adds modular inverse tests through the field/scalar interface, now that those use modinv. * **Remove unused Jacobi symbol support** No longer needed. * **Remove num/gmp support** Bye-bye. * 3 commits with further optimizations. ACKs for top commit: gmaxwell: ACK 24ad04f sanket1729: ACK 24ad04f real-or-random: ACK 24ad04f careful code review, some testing Tree-SHA512: 732fe29315965e43ec9a10ee8c71eceeb983c43fe443da9dc5380a5a11b5e40b06e98d6abf67b773b1de74571fd2014973c6376f3a0caeac85e0cf163ba2144b
…ster 5c7ee1b libsecp256k1 no longer has --with-bignum= configure option (Pieter Wuille) bdca9bc Squashed 'src/secp256k1/' changes from 3967d96bf1..efad3506a8 (Pieter Wuille) cabb566 Disable certain false positive warnings for libsecp256k1 msvc build (Pieter Wuille) Pull request description: This updates our src/secp256k1 subtree to the latest upstream master. The changes include: * The introduction of safegcd-based modular inverses, reducing ECDSA signing time by 25%-30% and ECDSA verification time by 15%-17%. * [Original paper](https://gcd.cr.yp.to/papers.html) by Daniel J. Bernstein and Bo-Yin Yang * [Implementation](bitcoin-core/secp256k1#767) by Peter Dettman; [final](bitcoin-core/secp256k1#831) version * [Explanation](https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md) of the algorithm using Python snippets * [Analysis](https://github.com/sipa/safegcd-bounds) of the maximum number of iterations the algorithm needs * [Formal proof in Coq](https://medium.com/blockstream/a-formal-proof-of-safegcd-bounds-695e1735a348) by Russell O'Connor, for a high-level equivalent algorithm * Removal of libgmp as an (optional) dependency (which wasn't used in the Bitcoin Core build) * CI changes (Travis -> Cirrus) * Build system improvements ACKs for top commit: laanwj: Tested ACK 5c7ee1b Tree-SHA512: ad8ac3746264d279556a4aa7efdde3733e114fdba8856dd53218588521f04d83950366f5c1ea8fd56329b4c7fe08eedf8e206f8f26dbe3f0f81852e138655431
5c7ee1b libsecp256k1 no longer has --with-bignum= configure option (Pieter Wuille) bdca9bc Squashed 'src/secp256k1/' changes from 3967d96..efad350 (Pieter Wuille) cabb566 Disable certain false positive warnings for libsecp256k1 msvc build (Pieter Wuille) Pull request description: This updates our src/secp256k1 subtree to the latest upstream master. The changes include: * The introduction of safegcd-based modular inverses, reducing ECDSA signing time by 25%-30% and ECDSA verification time by 15%-17%. * [Original paper](https://gcd.cr.yp.to/papers.html) by Daniel J. Bernstein and Bo-Yin Yang * [Implementation](bitcoin-core/secp256k1#767) by Peter Dettman; [final](bitcoin-core/secp256k1#831) version * [Explanation](https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md) of the algorithm using Python snippets * [Analysis](https://github.com/sipa/safegcd-bounds) of the maximum number of iterations the algorithm needs * [Formal proof in Coq](https://medium.com/blockstream/a-formal-proof-of-safegcd-bounds-695e1735a348) by Russell O'Connor, for a high-level equivalent algorithm * Removal of libgmp as an (optional) dependency (which wasn't used in the Bitcoin Core build) * CI changes (Travis -> Cirrus) * Build system improvements ACKs for top commit: laanwj: Tested ACK 5c7ee1b Tree-SHA512: ad8ac3746264d279556a4aa7efdde3733e114fdba8856dd53218588521f04d83950366f5c1ea8fd56329b4c7fe08eedf8e206f8f26dbe3f0f81852e138655431
5c7ee1b libsecp256k1 no longer has --with-bignum= configure option (Pieter Wuille) bdca9bc Squashed 'src/secp256k1/' changes from 3967d96..efad350 (Pieter Wuille) cabb566 Disable certain false positive warnings for libsecp256k1 msvc build (Pieter Wuille) Pull request description: This updates our src/secp256k1 subtree to the latest upstream master. The changes include: * The introduction of safegcd-based modular inverses, reducing ECDSA signing time by 25%-30% and ECDSA verification time by 15%-17%. * [Original paper](https://gcd.cr.yp.to/papers.html) by Daniel J. Bernstein and Bo-Yin Yang * [Implementation](bitcoin-core/secp256k1#767) by Peter Dettman; [final](bitcoin-core/secp256k1#831) version * [Explanation](https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md) of the algorithm using Python snippets * [Analysis](https://github.com/sipa/safegcd-bounds) of the maximum number of iterations the algorithm needs * [Formal proof in Coq](https://medium.com/blockstream/a-formal-proof-of-safegcd-bounds-695e1735a348) by Russell O'Connor, for a high-level equivalent algorithm * Removal of libgmp as an (optional) dependency (which wasn't used in the Bitcoin Core build) * CI changes (Travis -> Cirrus) * Build system improvements ACKs for top commit: laanwj: Tested ACK 5c7ee1b Tree-SHA512: ad8ac3746264d279556a4aa7efdde3733e114fdba8856dd53218588521f04d83950366f5c1ea8fd56329b4c7fe08eedf8e206f8f26dbe3f0f81852e138655431
5c7ee1b libsecp256k1 no longer has --with-bignum= configure option (Pieter Wuille) bdca9bc Squashed 'src/secp256k1/' changes from 3967d96..efad350 (Pieter Wuille) cabb566 Disable certain false positive warnings for libsecp256k1 msvc build (Pieter Wuille) Pull request description: This updates our src/secp256k1 subtree to the latest upstream master. The changes include: * The introduction of safegcd-based modular inverses, reducing ECDSA signing time by 25%-30% and ECDSA verification time by 15%-17%. * [Original paper](https://gcd.cr.yp.to/papers.html) by Daniel J. Bernstein and Bo-Yin Yang * [Implementation](bitcoin-core/secp256k1#767) by Peter Dettman; [final](bitcoin-core/secp256k1#831) version * [Explanation](https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md) of the algorithm using Python snippets * [Analysis](https://github.com/sipa/safegcd-bounds) of the maximum number of iterations the algorithm needs * [Formal proof in Coq](https://medium.com/blockstream/a-formal-proof-of-safegcd-bounds-695e1735a348) by Russell O'Connor, for a high-level equivalent algorithm * Removal of libgmp as an (optional) dependency (which wasn't used in the Bitcoin Core build) * CI changes (Travis -> Cirrus) * Build system improvements ACKs for top commit: laanwj: Tested ACK 5c7ee1b Tree-SHA512: ad8ac3746264d279556a4aa7efdde3733e114fdba8856dd53218588521f04d83950366f5c1ea8fd56329b4c7fe08eedf8e206f8f26dbe3f0f81852e138655431
5c7ee1b libsecp256k1 no longer has --with-bignum= configure option (Pieter Wuille) bdca9bc Squashed 'src/secp256k1/' changes from 3967d96..efad350 (Pieter Wuille) cabb566 Disable certain false positive warnings for libsecp256k1 msvc build (Pieter Wuille) Pull request description: This updates our src/secp256k1 subtree to the latest upstream master. The changes include: * The introduction of safegcd-based modular inverses, reducing ECDSA signing time by 25%-30% and ECDSA verification time by 15%-17%. * [Original paper](https://gcd.cr.yp.to/papers.html) by Daniel J. Bernstein and Bo-Yin Yang * [Implementation](bitcoin-core/secp256k1#767) by Peter Dettman; [final](bitcoin-core/secp256k1#831) version * [Explanation](https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md) of the algorithm using Python snippets * [Analysis](https://github.com/sipa/safegcd-bounds) of the maximum number of iterations the algorithm needs * [Formal proof in Coq](https://medium.com/blockstream/a-formal-proof-of-safegcd-bounds-695e1735a348) by Russell O'Connor, for a high-level equivalent algorithm * Removal of libgmp as an (optional) dependency (which wasn't used in the Bitcoin Core build) * CI changes (Travis -> Cirrus) * Build system improvements ACKs for top commit: laanwj: Tested ACK 5c7ee1b Tree-SHA512: ad8ac3746264d279556a4aa7efdde3733e114fdba8856dd53218588521f04d83950366f5c1ea8fd56329b4c7fe08eedf8e206f8f26dbe3f0f81852e138655431
Implements constant-time field and scalar inversion using a Euclidean-style algorithm operating on the least-significant bits. It follows the curve25519 case study as outlined in section 12 of the above paper, with some small tweaks.
I do not have the actual curve25519 code to compare to, but this PR appears to be already 15-20% faster than their reported 10050 haswell cycles for field inversion.Actually, this initial measurement was in error, but we did reach 5-10% faster in the end, without any asm.Performance comparison (gcc 10.1.0, -O3, haswell, endo=yes, asm=no, field=64bit scalar=64bit):
master:
scalar_inverse: min 12.4us
field_inverse: min 5.04us
ecdh: min 59.7us
ecdsa_sign: min 47.9us
safegcd_inv:
scalar_inverse: min 3.45us
field_inverse: min 3.23us
ecdh: min 56.2us
ecdsa_sign: min 37.7us
I've not done a 32bit version yet, although it is mostly analogous and should have an even greater relative advantage over Fermat. The field and scalar implementations have substantial common code that is duplicated here for simplicity.