-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try a non-uniform group law (e.g., for ecmult_gen)? #1051
Comments
ISTR trying "a better" group law from EFD, many years ago, which turned out to be slower because of normalization even though it technically required fewer ops. May be worth revisiting this after recent Dettman normalization bounds improvements though? |
Maybe what you're remembering is https://github.com/bitcoin-core/secp256k1/blob/master/src/group_impl.h#L274-L280 ? But this is for |
Doubtless it will be faster. The most obvious use-case to me is for precomputed tables of small multiples. Back in the long-long-ago I PR'd some CoZ formulae for precomputed tables (#41 - LOL at "That's the GMP inversion[...]; it will be hard to match it's performance.") and I will be looking to resurrect that as a private implementation for a separated out "precomp_table" (or some such) module. However @sipa also noted there "I don't like relying on relatively complex reasoning to show that a code path shouldn't be reachable, even if it just contains an assert" and that was in relation to this issue of assuming P != +/-Q. #1032 is an attempt to begin putting in place some VERIFY metadata for group elements. I think proposals that rely on, shall we say, "non-local reasoning" for correctness would ideally be accompanied by a plan for using VERIFY metadata to validate that reasoning. One example is "small multiples reasoning". The metadata could hold e.g. an actual ge_storage value of the original point (or some other unique identifier for the overall "operation"), and an int indicating the current multiple (or e.g 0 if it's not participating in such a scheme). Then any methods operating on these points can have suitable checks. I think another example of the kind of reasoning people sometimes use is "odd OR even multiple of P, not exceeding half the order", and then non-uniform additions of "odd" points might be used for a ladder as long as there is an intervening double between them, with maybe one or two uniform additions at the end when "half the order" becomes a problem. That last example is one where you'd need to be very careful in the presence of the endomorphism, but we could probably just say "Beware the endomorphism" in general here. Also notably some kinds of blinding generally don't play well with non-uniform formulae. |
@apoelstra I seem to recall you like endomorphism math :), so a holiday puzzle for you in relation to the above: If I'm enumerating some non-infinity point on secp256k1, 1P, 2P, 3P..., at what point (haha) nP do I need to start worrying that I may run into nP == +/- lambda(mP) where 1 <= m <= n? |
I spent some time thinking about using non-uniform addition (which I assume means a rule that cannot handle doubling or cancellation correctly) in combination with signed-digit multicomb, because there seems a natural overlap: in every step SDMC adds a nonzero point, with a multiple of G that whose scalar has new bits set. Turns out, it is not that simple (I simulated it in Python for a small group), but it is close. I suspect that for almost all SDMC configurations, only the very last point addition can result in doubling or cancelling. I think this can be proven, but doing so generically for any configuration may be nontrivial. Perhaps we'll need some whitelisted precomputed configurations for which this is proven to hold. A downside of this approach is that it cannot start with a blinded starting point, as obviously nothing can be guaranteed about intermediary results when starting from a fully random point. So if blinding is done on the scalar, it should be counteracted by an additional (uniform) point addition at the very end. That's not actually an extra cost, if the starting point can be made infinity (so the first table lookup is just a fetch rather than an add). |
@sipa Generally the signed-data representation extends past the order (e.g. 260 bits). You could have a rule that only the highest block can cross that threshold, then special-case the additions for the top block? Edit: 260, not 250! |
Hmm, not that simple either probably, since it's top to bottom. |
@peterdettman I don't think it's that simple. At every point, your intermediary result can be written as Even if you were to do a reduction mod N on the sign-recoded scalar (guaranteeing only 256 d values), you may get collisions between the existing pattern and what is being added, because the 0s in the pattern count as 1/2 in the signed recoding, and (1/2 mod N) isn't a perfectly nice number. |
@sipa Recall that if you just shift the SD bits back "up" 1, then they're all now representing 0 or 1 again, and each table entry is handling strictly distinct bits, so problems should only happen once the high bits start getting added in, no? |
@peterdettman I'm not convinced, because the pattern of zero/nonzero bits differs between the intermediary value and the table entry being added, and the difference between these patterns translates to less nice patterns when "projected" onto pure 1,-1 values. I think it is indeed the case that you don't hit collisions until you've added almost all bits, but I believe this is mostly due to information theoretical reasons (the table entries are drawn from a small sets, the intermediary values are a small fraction of the total space until the very end, collisions are just very rare). |
@sipa In particular, if COMB_SPACING is larger than the excess bits of the SD recoding, then each top-level pass in _ecmult_gen (i.e. the loop over the blocks) covers < 256 bit range and could be accumulated individually with non-uniform addition, and then added (uniform) to the overall accumulator, just before the doubling for each pass. Not saying it's ideal, but that should be correct ignoring the blinding etc. |
|
I conjecture that a1,b1 and a2,b2 from Lines 86 to 89 in 423b6d1
a1 = -lambda*b1 for some definition of smallest. And that you will get almost identical values for lambda and +/-lambda^2.
My handwavy argument is that if smaller values existed, we would be using them instead. |
@peterdettman I wrote a simulator for the existing code that can determine for every addition in the SDMC algorithm which ones risk triggering doubling/cancellation. For all the choices on the efficiency frontier from #693 (comment), only very small ones (1 block, 1-2 teeth) have more than 1 addition that needs to deal with doubling/cancelling. Assuming my code is right, of course. Choices with teeth=1 or spacing=1 seem often bad in this regard (e.g. 32 blocks of 8 teeth has 16 possibly-doubling additions), but excluding those two, no configuration has more than 2 possibly-doubling ones. |
Okay, sometimes it helps to read the literature.
Renes, Costello and Batina (EUROCRYPT 2016) provide a complete, exception-free addition law with 11M + 0S (Algorithm 8 and page 12). It may still be the case that add-2007-bl is faster but it's not clear. This algorithm also highly relevant to #1033. And it's even more relevant to verification speed. See the discussion on gej_add_var on page 4... We can save 4S, which means we go from 12M+4S to 12M+0S (!!!). @sipa Didn't they contact you back then? |
@real-or-random Woah! No, that's news to me. Just skimmed the paper, and it seems it is using homogeneous coordinates, not Jacobian ones? That may require rewriting all of the group laws... |
@real-or-random I'm sure I've seen this paper before, and I would have thought it was already discussed (perhaps @gmaxwell brought it to our attention?), but am not sure. Anyway, some notes to curb the enthusiasm a little:
Haven't (re-)read in detail yet, so it's quite possible there's still gold there that I haven't noticed. I think it would still be informative to put together a branch where we add secp256k1_geh and try to use it in the |
Oh indeed, I missed that they use homogeneous coordinates (I just saw the Z coord and assumed jacobian). (I had this feeling that this is too good to be true but I didn't see the issue. :D) And yes, sure, when it comes to variable-time ecmults, it's relevant only to Pippenger's algorithm. But in this case it could be beneficial if we compute the doublings in jacobian coordinates (a bunch of doublings) and then do the main part in homogeneous coordinates (additions). See also Remark 3 (p. 16) in the paper. (edit: Hmmm, maybe we'd need to keep the buckets in homogeneous coordinates too.... It's not clear right now to me, we will need to try.)
Right, but we use the tricks only for in Strauss' algorithm. But yeah, this is also bad for the exhaustive tests, I guess. |
It sounds promising. I'm not quite clear on how the 32 blocks/8 teeth thing has so many danger points, but my mind is kind of on other stuff at the moment. Do we see any way that the endomorphism can be useful in signing? I see at least that you could split the scalar and use half the blocks e.g. 2B * 5T * 13S to cover 130 bits (unfortunately this probably needs skew-handling like ecdh). This gains nothing if you need 2 separate sets of tables anyway, but it halves memory usage if you construct the endo-points on-the-fly instead of the second set of tables (in this 2/5/13 config that's an extra 26 field muls). Probably that doesn't gain much given that you can probably just tweak the SDMC config to use less memory at a similar performance cost anyway. I guess if a given config has lots of doublings (big spacing), then half-ish of them could be saved instead. Anything else? |
@peterdettman I'm sure I've had discussions about this, but I can't find it on the issue tracker. Yes, I think using the endomorphism may give another trade-off dimension to ecmult_gen. As you say, it won't speed things up in absolute terms, but you could +- halve the table size, split the computation in two, and do lambda-multiplies on table fetches in the second part. |
Fwiw, I sketched a modified variant of Pippenger's algorithm that uses homogeneous coordinates in the main loop (without actually implementing the group law): I think it's promising... |
@real-or-random Looks good, yes. I think the final few lines are a bit off though: buckets[0] is missing a conversion, and also running_sum, or maybe it's as easy to add buckets[0] after running_sum is converted. |
https://github.com/sipa/secp256k1/commits/202112_combine_ecmult_gen (=master + #1058 + #1056 + #1033 + #1028 + incomplete addition formula): 12.3 us per schnorrsig (master is 17.9 us). |
Which law did you try? The one I mentioned in the initial comment needs 7M+4S but the one you implemented in your branch needs 8M+3S. |
I just copied our vartime code, without the var parts. It's easy to plug in something else for testing. |
The 7M + 4S one requires calculating 2.Z1.H as (Z1+H)^2 - Z1^2 - H^2, which is a well-known trick, but hasn't tended to speed things up for us (especially if we only want Z1.H because of halving). For me S ~ 0.8M, and the alternative formula requires extra additions, sometimes an extra negation, sometimes a normalization, sometimes extra locals or longer lifetimes etc. However if it even breaks even for 64bit, it could easily be a speedup for 32bit where linear operations are significantly cheaper in relation to mul/sqr. |
Questions like that may become more complicated with #967 too (as it introduces a new field approach with very different add/mul/normalize tradeoffs, and an independent magnitude-like dimension). |
Renes–Costello–Batina seems like a great security improvement since those are exception-free / remove |
WIP simulation code to determine which point additions inside the SDMC algorithm risk triggering doubling (P+P) or cancellation (P + -P): https://gist.github.com/sipa/868c39e29af9e8baf22845c8af2d316d. It would appear that for really all feasible (blocks,teeth,spacing) combinations, either only the very last addition, or the last two additions, risk running into doubling/cancellation. |
I believe that our current During the algorithm, the scalar And analogously for The algorithm for computing
It's easy to see that the result of both point additions in the loop above is of the form Up to To show that no non-zero |
The group law in
secp256k1_gej_add_ge
is uniform, i.e., it works for P + Q no matter if P != Q or P = Q. (It still needs to handle infinity and has another special case.)It needs 7 fe mul and 5 fe sqr, the best non-uniform group law in EFD needs one fewer sqr (https://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-madd-2007-bl). Mabye it's faster also in practice. We could simply implement it and benchmark it.
Depending on the specifics of our ecmult_gen algorithm, it may be the case that this algorithm never adds P + Q with P = Q (for example, the ecmult algorithm within the MuSig-DN circuit has a similar property)). Then, we could get away with the non-uniform law in ecmult_gen.
I think we shouldn't really think about this before #693 has been done.
The text was updated successfully, but these errors were encountered: