-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster const-time normalization #1028
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice idea! I just had one inline question.
The updated versions of fe_normalize
and fe_normalizes_to_zero
indeed look faster, and a local benchmark run seems to agree:
Benchmarks before
Benchmark , Min(us) , Avg(us) , Max(us)
field_normalize , 0.00968 , 0.00973 , 0.00983
group_add_affine , 0.189 , 0.196 , 0.205
ecdsa_sign , 22.3 , 22.7 , 23.9
Benchmarks after
Benchmark , Min(us) , Avg(us) , Max(us)
field_normalize , 0.00752 , 0.00756 , 0.00763
group_add_affine , 0.189 , 0.189 , 0.191
ecdsa_sign , 22.3 , 22.4 , 22.5
I also wrote down some notes as I was trying to understand the change, and I'll post them here in case they're useful for future reference.
Why it's not possible for the result to be in [P, 2^256)
If there's no final carry bit, then the second pass will subtract 2^256 - P from a value that's at most 2^256 - 1, resulting in a value that's at most P - 1.
If there was a carry, the largest possible value (not including the carry bit) is 65536 * 0x1000003D1 - 1 for 5x52 and 1024 * 0x1000003D1 - 1 for 10x26, both of which are well below P.
(If there was a carry bit, clearing it is equivalent (mod P) to decrementing by 2^256 - P, which cancels out the increment before the first pass.)
Where the 63-bit shift (in the 5x52 version) came from
The 64th bit is nonzero if and only if there was underflow after subtraction (2^256 - P is only a 33-bit value, so after subtracting it it's not possible to underflow so much that the 64th bit is still cleared afterwards).
Thus subtracting t >> 63 is the same logic as "if subtraction on the previous limb caused an underflow then subtract 1, otherwise subtract 0".
If we store a 52-bit value in a 64-bit word, underflowing it and then keeping the bottom 52 bits is equivalent to underflowing the same value stored in a 52-bit word.
(An analogous explanation works for where the 31-bit shift came from in the 10x26 case.)
/* Reduce t4 at the start so there will be at most a single carry from the first pass. | ||
* x is incremented before the first pass and then decremented before the second pass | ||
* to ensure that the result doesn't fall into the range [P, 2^256). */ | ||
uint64_t x = (t4 >> 48) + 1; t4 &= 0x0FFFFFFFFFFFFULL; | ||
|
||
/* The first pass ensures the magnitude is 1, ... */ | ||
t0 += x * 0x1000003D1ULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that x
is bigger than before, how we know that t0 += x * 0x1000003D1ULL
won't overflow the uint64_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most, if not all, callers use field elements with a maximum magnitude of 8, so there's plenty of room. The practical limit is actually around 31 (because field_10x26 can only represent magnitude 32 and several methods assume there's a free bit available for carries on entry). I think it would be a good idea to actually enforce some exact magnitude limit on input to field methods though (probably 16).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 4bbb661
My only concern from before was magnitude / overflow. I agree it'd be a good idea to enforce some exact magnitude limit and/or more carefully audit maximum possible magnitudes. But I don't think it should block this PR—especially since I also tried running tests with VERIFY_CHECK(r->magnitude <= 16)
at the top of fe_normalize
and it passed.
2a7b0e1
to
a44bc96
Compare
a44bc96
to
4c08c59
Compare
4c08c59
to
8932956
Compare
Use a "borrowing" trick in _fe_normalize to simplify the handling of values in [P, 2^256). This is significantly faster too (according to bench_internal), though performance of _fe_normalize isn't that important.
Applying a similar idea to _fe_normalizes_to_zero also results in a performance boost, and in this case it appears to boost signing speed by 2% or more (also ECDH), due to three calls in _gej_add_ge.