-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: Have a common base for Kyber/Dilithium structures #4024
Refactor: Have a common base for Kyber/Dilithium structures #4024
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
64f9db8
to
066fc6d
Compare
dd45e61
to
c30b9fe
Compare
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below are a few highlights and potential conversion starters of things I saw along the way. :)
// This is a mitigation for a potential side channel called "KyberSlash". | ||
// It implements the division by Q using a multiplication and a shift. Most | ||
// compilers would generate similar code for such a division by a constant. | ||
// Though, in some cases, compilers might use a variable-time int division, | ||
// resulting in a potential side channel. | ||
// | ||
// See "Hacker's Delight" (Second Edition) by Henry S. Warren, Jr. | ||
// Chapter 10-9 "Unsigned Division by Divisors >= 1" | ||
auto divide_by_q = [](uint32_t n) -> unsigned_T { | ||
static_assert(KyberConstants::Q == 3329); | ||
BOTAN_DEBUG_ASSERT(n < (1 << 23)); | ||
|
||
// These constants work for all values that appear in Kyber with the | ||
// greatest being 3328 * 2^11 + Q // 2 = 6,817,408 < 2**23 = 8,388,608. | ||
constexpr uint64_t m = 2580335; | ||
constexpr size_t p = 33; | ||
return static_cast<unsigned_T>((n * m) >> p); | ||
}; | ||
|
||
constexpr unsigned_T mask = (1 << d) - 1; | ||
return divide_by_q((static_cast<uint32_t>(x) << d) + KyberConstants::Q / 2) & mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adapted the KyberSlash countermeasure to also work for the compression instances with d = 10
and d = 11
. Not sure it's necessary but doesn't hurt either. I'd definitely be happy about some other sharp pair of eyes.
// TODO: perhaps secure vector | ||
std::vector<T> m_coeffs_storage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polynomials store their coefficients as a std::vector<>
. Some of those store secret values and should therefore use secure_vector
. Also, it might be beneficial to add first-level support for Polynomials
(and friends) as Strong types. (Most likely this would be another pull request, though.)
namespace detail { | ||
|
||
constexpr auto as_byte_source(BufferSlicer& slicer) { | ||
return [&](std::span<uint8_t> out) { slicer.copy_into(out); }; | ||
} | ||
|
||
constexpr auto as_byte_source(Botan::XOF& xof) { | ||
return [&](std::span<uint8_t> out) { xof.output(out); }; | ||
} | ||
|
||
} // namespace detail | ||
|
||
template <typename T> | ||
concept byte_source = | ||
requires(T& t) { requires std::invocable<decltype(detail::as_byte_source(t)), std::span<uint8_t>>; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is a top-level concept we might want to establish deeper in the library. Perhaps as an (internal) replacement for the DataSource
class.
Namely: A thing that can provide a concrete number of bytes where I don't really know how many bytes will be needed. The source may be infinite or not.
Things that could fall in this abstraction may be:
- the
RandomNumberGenerator
- a
XOF
- read from some buffer (think
BufferSlicer
) - from some I/O source (e.g. the network)
In Kyber/Dithithium this is used to either sample some random polynomial from a XOF or to unmarshal it from a bitpacked encoding.
Other use cases may be:
- parsing in general (TLS, X.509 and friends)
- data loading, e.g. like:
load_be<uint64_t>(rng)
to get a random 64 bit integer. (See also Internal: load_be/le should accept a BufferSlicer #4068) - among other things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a technical perspective, one might even implement it like shown here: As a C++20 concept that tries to build a thin wrapper around "the source". That way, new data sources could just hook themselves into this "helper" and decide for themselves to be a "data source" or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool idea 👍. I'm not sure if we want to commit to a pure concept construction, though. In some cases, it may be more convenient to have a class like DataSource. We could still extend the DataSource class to fulfill the respective concept if one wants to juggle with templates.
// TODO: Remove the need for the montgomery transformation | ||
// | ||
// -> When calculating A*s below, A is not in montgomery form, but s is. The | ||
// operation uses fqmul internally, which performs a montgomery reduction. | ||
auto A = montgomery(kyber_sample_matrix(rho, false /* not transposed */, mode)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was different in the reference implementation. @FAlbertDev we did look into this a while ago, I feel we should take this up again and get to the bottom of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just investigated the usage of Montgomery in Kyber. First, a short recap on Montgomery:
The main observation is that every Montgomery multiplication
In our Kyber and Dilithium code, the following operations add or remove the Montgomery factor.
- the function
montgomery
adds a factor$R$ - the function
inv_ntt
adds a factor$R$ . Note thatntt
doesn't add a Montgomery factor. - As already said, a Montgomery multiplication adds a factor
$R^{-1}$ to each poly, vector, or matrix element. This holds for all multiplication types, i.e., polynomial-polynomial, vector-vector, matrix-vector multiplication, etc.
For encryption and decryption, this setup works well since there is always one multiplication (adding factor inv_ntt
(adding factor inv_ntt
. We have a factor of montgomery
. Instead, I propose to add this factor to t
after the multiplication. The fewer elements multiplied by
auto A = kyber_sample_matrix(rho, false /* not transposed */, mode);
auto s = ntt(ps.sample_polynomial_vector_cbd_eta1());
const auto e = ntt(ps.sample_polynomial_vector_cbd_eta1());
auto t = montgomery(A * s);
t += e;
t.reduce();
// End Algorithm 12 ---------------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for double-checking and the detailed write-up! It would probably be really nice to track the montgomery parameter in the type system as well, but it does feel like overkill.
Also: by applying montgomery()
to the t
vector, we can remove the overload for PolynomialMatrix
altogether. :)
using KyberPolyNTT = Botan::CRYSTALS::Polynomial<KyberPolyTraits, Botan::CRYSTALS::Domain::NTT>; | ||
using KyberPolyVecNTT = Botan::CRYSTALS::PolynomialVector<KyberPolyTraits, Botan::CRYSTALS::Domain::NTT>; | ||
using KyberPolyMat = Botan::CRYSTALS::PolynomialMatrix<KyberPolyTraits>; | ||
|
||
using KyberPoly = Botan::CRYSTALS::Polynomial<KyberPolyTraits, Botan::CRYSTALS::Domain::Normal>; | ||
using KyberPolyVec = Botan::CRYSTALS::PolynomialVector<KyberPolyTraits, Botan::CRYSTALS::Domain::Normal>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned before: it would be good if these could be the basis for strong types to disambiguate them in interfaces E.g. to replace the multiple poly_pack_t0
, ..._t1
, ... methods in dilithium_algos.cpp
with an overloaded poly_pack()
that just "does the right thing" ™ based on the strong type.
enum DilithiumTau : uint32_t { _39 = 39, _49 = 49, _60 = 60 }; | ||
|
||
enum DilithiumLambda : uint32_t { _128 = 128, _192 = 192, _256 = 256 }; | ||
|
||
enum DilithiumGamma1 : uint32_t { ToThe17th = (1 << 17), ToThe19th = (1 << 19) }; | ||
|
||
enum DilithiumGamma2 : uint32_t { Qminus1DevidedBy88 = (Q - 1) / 88, Qminus1DevidedBy32 = (Q - 1) / 32 }; | ||
|
||
enum DilithiumEta : uint32_t { _2 = 2, _4 = 4 }; | ||
|
||
enum DilithiumBeta : uint32_t { _78 = 78, _196 = 196, _120 = 120 }; | ||
|
||
enum DilithiumOmega : uint32_t { _80 = 80, _55 = 55, _75 = 75 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dilithium has many integer constants that change depending on the instantiation of the algorithm. I found it worthwhile to restrict them with explicit enums. I'm okay with this concept, also given the restricted (algorithm-internal) scope. But I feel there must be a better way for this. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like this idea. While the implied switch-cases are annoying, they guarantee we do not forget a value.
dilithium_poly_pack_t0(p, stuffer); | ||
} | ||
|
||
BOTAN_ASSERT_NOMSG(stuffer.full()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those became ubiquitous and at this point I do start wondering whether we should just put that in the destructors of Stuffer
and Slicer
(perhaps explicitly disengagable). What do y'all think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that after 3.5.0 is cut, to avoid unwanted surprises.
cf16bf3
to
3fc00ec
Compare
3fc00ec
to
8dcd72b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only looked into dilithium and the crystals module so far. The other sections (including Kyber) will come tomorrow.
For algorithms of this high complexity, the new structure looks very well-organized and clear :)
I've only found some minor nits so far.
(Btw sorry if I also marked lots of removed code. The VSCode extension for GitHub caused this)
1a3ac57
to
ae4298c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final review iteration.
I really like your work. IMO it's the better reference implementation ;)
// TODO: Remove the need for the montgomery transformation | ||
// | ||
// -> When calculating A*s below, A is not in montgomery form, but s is. The | ||
// operation uses fqmul internally, which performs a montgomery reduction. | ||
auto A = montgomery(kyber_sample_matrix(rho, false /* not transposed */, mode)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just investigated the usage of Montgomery in Kyber. First, a short recap on Montgomery:
The main observation is that every Montgomery multiplication
In our Kyber and Dilithium code, the following operations add or remove the Montgomery factor.
- the function
montgomery
adds a factor$R$ - the function
inv_ntt
adds a factor$R$ . Note thatntt
doesn't add a Montgomery factor. - As already said, a Montgomery multiplication adds a factor
$R^{-1}$ to each poly, vector, or matrix element. This holds for all multiplication types, i.e., polynomial-polynomial, vector-vector, matrix-vector multiplication, etc.
For encryption and decryption, this setup works well since there is always one multiplication (adding factor inv_ntt
(adding factor inv_ntt
. We have a factor of montgomery
. Instead, I propose to add this factor to t
after the multiplication. The fewer elements multiplied by
auto A = kyber_sample_matrix(rho, false /* not transposed */, mode);
auto s = ntt(ps.sample_polynomial_vector_cbd_eta1());
const auto e = ntt(ps.sample_polynomial_vector_cbd_eta1());
auto t = montgomery(A * s);
t += e;
t.reduce();
// End Algorithm 12 ---------------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review still need to look at changes in dilithium
and kyber
e6a97fd
to
e9863c1
Compare
Everything so far looks good but I'm not enthused about dropping big refactors just before a release so pushing this to 3.6.0 |
See: randombit/botan#3887 This is a fairly minimal adaption, once the full refactoring is merged, we'll have to rewrite this more substantially. That won't happen before Botan 3.5.0, though. See also: randombit/botan#4024
template <int32_t range, crystals_trait PolyTrait, Domain D, coeff_map_fn<typename PolyTrait::T> MapFnT> | ||
constexpr void pack(const Polynomial<PolyTrait, D>& p, BufferStuffer& stuffer, MapFnT map) { | ||
using trait = BitPackingTrait<range, PolyTrait>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the gist of this be useful beyond the CRYSTALS scope? E.g. FrodoKEM has a pack/unpack routine that -- on first glance at least -- looks like it could use it as well. @randombit perhaps there are more examples?
botan/src/lib/pubkey/frodokem/frodokem_common/frodo_matrix.cpp
Lines 340 to 384 in 3657a72
void FrodoMatrix::pack(const FrodoKEMConstants& constants, StrongSpan<FrodoPackedMatrix> out) const { | |
const size_t outlen = packed_size(constants); | |
BOTAN_ASSERT_NOMSG(out.size() == outlen); | |
size_t i = 0; // whole bytes already filled in | |
size_t j = 0; // whole uint16_t already copied | |
uint16_t w = 0; // the leftover, not yet copied | |
uint8_t bits = 0; // the number of lsb in w | |
while(i < outlen && (j < element_count() || ((j == element_count()) && (bits > 0)))) { | |
/* | |
in: | | |********|********| | |
^ | |
j | |
w : | ****| | |
^ | |
bits | |
out:|**|**|**|**|**|**|**|**|* | | |
^^ | |
ib | |
*/ | |
uint8_t b = 0; // bits in out[i] already filled in | |
while(b < 8) { | |
const uint8_t nbits = std::min(static_cast<uint8_t>(8 - b), bits); | |
const uint16_t mask = static_cast<uint16_t>(1 << nbits) - 1; | |
const auto t = static_cast<uint8_t>((w >> (bits - nbits)) & mask); // the bits to copy from w to out | |
out[i] = out[i] + static_cast<uint8_t>(t << (8 - b - nbits)); | |
b += nbits; | |
bits -= nbits; | |
if(bits == 0) { | |
if(j < element_count()) { | |
w = m_elements.at(j); | |
bits = static_cast<uint8_t>(constants.d()); | |
j++; | |
} else { | |
break; // the input vector is exhausted | |
} | |
} | |
} | |
if(b == 8) { // out[i] is filled in | |
i++; | |
} | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good. I left one bikeshed suggestion that to me feels a little cleaner - but it's marginal, feel free to ignore if you disagree.
Final comment, could use some history cleanup before merge |
Most notably, this is an abstract implementation to handle CRYSTALS polynomials, poly vectors and poly matrices. Also, a generic implementation for bit-packed encoding/decoding of coefficients useful for both Kyber and Dilithium. Co-Authored-By: Fabian Albert <[email protected]>
t1 is part of the public key and thus independent of any other verification input. Precomputing it saves about 20% of verification runtime when performing more than a single verification with the same Botan::PK_Verifier. Co-Authored-By: Fabian Albert <[email protected]>
For Kyber those are {512,768,1024} and for Dilithium {44,65,87} depending on the respective choice of parameter set.
a476620
to
8bd26c7
Compare
Rebased to master, cleaned up history and built the bikeshed. Waiting for CI, then finally drawing a line here. 🥳 |
This is essentially a complete overhaul of the Kyber and Dilithium implementations aiming at as much shared code as possible between the two algorithms. The rough structure is outlined below. I split-up the changes into three commits, that could be separate pull requests, if needed.
Module:
pqcrystals
This is a templated base for the polynomials (as well as poly vectors/matrices) used in both implementations. The base implementation shares math where possible and is pluggable where necessary. It also exposes its domain (NTT or "normal") in the type system. Also, it makes some initial optimization effort (mostly memory layout), but more work could be done here.
Both algorithms contain bit-packing algorithms to efficiently encode polynomial coefficients of various known value ranges. Both Kyber and Dilithium now share a generic and pluggable implementation of this.
Module:
kyber
This now makes use of the
pqcrystals
module. The polynomial instantiation happens inkyber_polynomial.h
(replacing the oldkyber_structures.h
). Most low-level "algorithms" (pseudo-codes of the spec) live inkyber_algos.cpp
. The indcpa "K-PKE" scheme is implemented inkyber_keys.cpp
and the cca2-secure wrapper can be found inkyber.cpp
. Auxiliary stuff like constants and strong types are in other headers.Note that this now contains almost no verbatim code from the reference implementation anymore.
Module:
dilithium
Likewise, this uses the
pqcrystals
module (instantiating the polynomials indilithium_polynomial.h
(replacing the almost verbatim reference implementation code). Thedilithium_algos.cpp
is home to most low-level "algorithms" (as seen in the spec).dilithium.cpp
houses the actual signature/verification algorihtms that bind it all together..Care has been taken to have a consistent implementation structure between Kyber and Dilithium. Note, though, that Dilithium's flexibility might require some love before a clean integration of ML-KEM (similar to what happened to Kyber in #3887).
Benchmark
Despite not explicitly aiming to increase speed,./botan speed
shows nice gains for Kyber and no losses for Dilithium.It might be worth looking into why Kyber is so much faster, though. Feels suspicious.Edit: The new implementation pre-computes and stores the expanded public matrix A. So as long as one re-uses thePK_KEM_Encryptor
for multiple encapsulations, one saves a lot of invocations to the XOF. That explains. 😏GCC 13.2 vs. Clang 18.1 on Linux (in WSL)