-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EC_Scalar and EC_AffinePoint types #4042
Conversation
03b58cd
to
bae2ce8
Compare
a70db6d
to
b09a762
Compare
4db015d
to
b594b6b
Compare
d5fa5d9
to
94c81d6
Compare
9d2a5c2
to
a66bee9
Compare
I will go through this in the coming days. Now with the underlying PR merged. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first review pass. Looks really good! Especially the simplified algorithm implementations!
Left some nits and suggestions. There's one finding that might point to a missing test, though.
|
||
BufferStuffer stuffer(bytes); | ||
stuffer.append(y_is_odd ? 0x03 : 0x02); | ||
stuffer.append(std::span{m_xy}.last(fe_bytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compressed encoding is oddness-header + x-value, no? Surprised that it didn't fail any tests.
stuffer.append(std::span{m_xy}.last(fe_bytes)); | |
stuffer.append(std::span{m_xy}.first(fe_bytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also: #4042 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL WTF that definitely should have caused test failures, need to look at this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK the reason this was missed is that currently this serialization logic in unused, because keys/etc are still using EC_Point
.
We're in for a somewhat awkward situation in the short term as we'll have 3 distinct elliptic curve point types - original flavor EC_Point
, this EC_AffinePoint
, plus the (internal only) pcurves AffinePoint
, with a lot of redundant/duplicated logic.
For 3.6.0 onward the situation improves
-
The various pubkey algorithms switch to storing their points as
EC_AffinePoint
. At that pointEC_Point
becomes unused except for testing; unfortunately we can't remove it completely until Botan 4. Maybe we'll be able to segregate it into a deprecated submodule ofec_group
though. -
EC_AffinePoint
gains a bridge to pcurve. This affords removing some redundancies between pcurves and what we actually need in practice. For example pcurves hasscalar_from_bits_with_trunc
but this is just modular reduction plus some shifts. It'll likely be simpler, once the bridge exists, to remove that and do the shifting in some higher point in the call stack so the logic is shared between the pcurves and BigInt based approaches.
src/lib/pubkey/rfc6979/rfc6979.cpp
Outdated
const BigInt& RFC6979_Nonce_Generator::nonce_for(const BigInt& m) { | ||
m.serialize_to(std::span{m_rng_in}.subspan(m_rlen)); | ||
std::vector<uint8_t> m_bytes(m_rlen); | ||
m.serialize_to(m_bytes); | ||
return this->nonce_for(m_bytes); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure its worth omitting the extra allocation, but you could serialize straight into m_rng_in
and then establish an internal method that just assumes m_rng_in
is properly set up. Similarly, for the std::span<>
overload, that could just copy_mem
into m_rng_in
and then also call this private method.
Like so:
m.serialize_to(std::span{m_rng_in}.subspan(m_rlen));
return this->generate_nonce();
While we're on it: it might be worthwhile to keep the span "std::span{m_rng_in}.subspan(m_rlen)
" as a member variable to avoid misuse (by malforming the subspan()
invocation) here and in the other overload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I think the original version of this was written before we finalized on the write-to-span idiom for BigInt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looked at this more and it doesn't seem worth bothering with.
There are two uses of RFC 6979 right now, ECDSA and DSA.
- ECDSA uses the bytes variant (or will, with this PR merged)
- DSA actually sets up a complete new
RFC6979_Nonce_Generator
object for each signature! Saving one extra allocation is the least of our worries 🙈
Any future users of RFC 6979 (eg ECGDSA or etc) will use the bytes interface, so this second overload of nonce_for
really is (will become) DSA specific at this point.
I think this can be cleaned up further - in particular in the long run I'd like to avoid the multiple conversions between BigInt
and EC_Scalar
implied by
const auto k = EC_Scalar::from_bigint(m_group, m_rfc6979->nonce_for(m_bytes: m.serialize()));
It's interesting that as code becomes faster, tiny optimizations become more meaningful. We're already under 250K cycles for ECDSA signatures for secp256r1 (for pcurves). At this point saving even a thousand cycles can be noticeable in the overall throughput.
I left a todo regarding this in #4027
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was somewhat surprised in #4143 that ECDSA once bridged was slower than our "demo" ECDSA directly using pcurves. I initially ascribed it to overhead of the abstractions but it turns out almost all of the overhead is RFC 6979. Almost 50K cycles on my laptop. I think, while implementing this in terms of HMAC_DRBG
was convenient, it's not very fast...
2450e74
to
5f46292
Compare
BTW I should mention in terms of reviewing, the overall structure behind the new types isn't that important since it's going to change quite a bit to handle the possibility of different backend implementations. That part is literally just the first thing that came to mind that works. Whereas |
08980de
to
eb4f1bd
Compare
OK with eb4f1bd in place, previous comment can be disregarded. I think this is more or less the "final" [*] approach for implementing the new scalar and point types. You can see in #4143 what the bridge looks like for pcurves. [*] Not ideal in any way, I'd like there to be fewer allocations, [**] Not technically true - we could get away with a third internal EC implementation - classic BigInt approach, pcurves, plus a third generic curve but bounded size impl. But it just doesn't seem worth it from a complexity perspective. I expect that in Botan 4, with the restrictions on curves in place, we'll transition away from BigInt entirely within EC. |
Deferring to 3.6 - it's better that all of this cook in master for a few months before we commit to anything wrt public API |
src/lib/pubkey/rfc6979/rfc6979.cpp
Outdated
uint8_t carry = 0; | ||
for(size_t i = 0; i != m_rng_out.size(); ++i) { | ||
const uint8_t w = m_rng_out[i]; | ||
m_rng_out[i] = (w >> shift) | carry; | ||
carry = w << (8 - shift); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all sure this is any worth it, performance wise. But we could do the shifting on 64-bit words (for instance) as long as possible (i'm assuming that m_rlen
isn't necessarily divisible by 8.
uint8_t carry = 0; | |
for(size_t i = 0; i != m_rng_out.size(); ++i) { | |
const uint8_t w = m_rng_out[i]; | |
m_rng_out[i] = (w >> shift) | carry; | |
carry = w << (8 - shift); | |
} | |
uint8_t carry = 0; | |
BufferSlicer slicer(m_rng_out); | |
BufferStuffer stuffer(m_rng_out); | |
while(slicer.remaining() >= 8) { | |
const auto w = load_be(slicer.take<8>()); | |
stuffer.append(store_be((w >> shift) | carry)); | |
carry = w << (64 - shift); | |
} | |
while(!slicer.empty()) { | |
const uint8_t w = slicer.take_byte(); | |
stuffer.append((w >> shift) | carry); | |
carry = w << (8 - shift); | |
} |
The BufferTransformer
proposed in #3942 could slim this down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In more general terms: this tiered transformation could even be a feature of the BufferTransformer
, where one passes a lambda and the BufferTransformer
takes care of the load/stores and the optimal strides. Like so:
BufferTransformer bt(m_rng_out);
bt.transform([carry = 0, shift](std::unsigned_integral auto i) mutable -> decltype(i) {
const auto r = (i >> shift) | carry;
carry = i << (sizeof(i) * 8 - shift);
return r;
});
... just thinking out loud, I'm aware that this might be a vast over-engineering. However, I believe, in #3883 I could use this for the bitvector<>
implementation that needs something similar for its bit operations.
Also, e.g. for block ciphers BT::transform
could be overloaded with a statically-sized span to factor out the loop and buffer handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this specific case, this shift only occurs for curves where the group order is not a multiple of 8, which consists of a set of weird curves (P239, etc almost all of which are deprecated already) plus P-521. So here I don’t think it matters much, especially when HMAC_DRBG is so expensive.
However I do think this general concept is worth exploring. A related (but possibly dissimilar enough that it’s not worth attempting to handle in the same abstraction) issue is in the mp code where we have explict unrolling for 8 and sometimes 4 words for various algorithms followed by a loop that handles the tail one word at a time. If we did this systematically for 16,8,4,2,1 word increments, that would likely lead to some nice performance improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: This draft of BufferTransformer
actually demonstates how we could systematically do the tiered block processing. Along those lines:
BufferTransformer bt(m_rng_out);
uint8_t carry = 0;
bt.process_blocks_of<8, 1>([&]<size_t s>(std::span<const uint8_t, s> in, std::span<uint8_t, s> out) {
const auto i = load_be(in);
using block_type = decltype(i);
const auto r = static_cast<block_type>((i >> shift) | static_cast<block_type>(carry));
carry = i << (sizeof(block_type) * 8 - shift);
store_be(out, r);
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@reneme Was not sure if you were planning further review of this |
I believe I wasn't fully done but no need to block the merge. |
f78faf5
to
2ada5e5
Compare
OK I'm going to go ahead and merge. Feel free to do a retro-review anytime, we have plenty of time between now and 3.6 |
These have a much more restrictive interface, as compared to our existing EC_Point and BigInt Co-authored-by: René Meusel <[email protected]>
Co-Authored-By: René Meusel <[email protected]>
Co-authored-by: René Meusel <[email protected]>
Co-authored-by: René Meusel <[email protected]>
Co-authored-by: René Meusel <[email protected]>
This will allow switching in pcurves later
2ada5e5
to
8042e9d
Compare
This is the other side of #3979 where we push the interface that's exposed to the rest of the library to a non-BigInt oriented model.
#4027