Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation-defined behaviour in montgomery_reduce() #77

Open
rod-chapman opened this issue Jun 6, 2024 · 11 comments
Open

Implementation-defined behaviour in montgomery_reduce() #77

rod-chapman opened this issue Jun 6, 2024 · 11 comments

Comments

@rod-chapman
Copy link

Our attempts at formal verification of the montgomery_reduce() function in ref/reduce.c reveal that this function depends on a subtle implementation-defined behaviour - namely a cast from int32_t to int16_t when the valued being converted lies outside the range of int16_t. (See C17 or C23 standards, para 6.3.1.3 for details.)

This is unfortunate from a portability point of view. It also confounds understanding of the code, since the actual semantics typically implemented by gcc and clang is rather non-obvious.

For example, line 20 of reduce.c:
t = (int16_t)a * QINV;
the cast of a to int16_t exhibits implementation-defined behaviour. There is also a second (invisible, implicit) cast there to int16_t that follows the evaluation of the multiplication, just before the assignment to t takes place.

I have developed an alternative implementation that is free from such non-portable behaviour.

@hanno-becker
Copy link

hanno-becker commented Jun 7, 2024

Conceptually, a * QINV is an unsigned computation mod 2^16, so I think the conversion int32_t -> int16_t should be replaced by int32_t -> uint16_t, which is well-defined. However, the result must then be lifted to a signed canonical representative, and I have not come across a C implementation not implementing uint16_t -> int16_t as exactly that (anyone else?). Still, @rod-chapman's right that this is strictly speaking implementation-defined.

I'd be inclined to merely document this limitation (in the code and at the top-level stating the C language requirements for this repository) and change the code to something like this:

int16_t montgomery_reduce(int32_t a)
{
  uint16_t u;
  int16_t t;
  // Compute a*q^{-1} mod 2^16 in unsigned representatives
  u = (uint16_t)a * QINV;
  // Lift to signed canonical representative mod 2^16.
  // PORTABILITY: This relies on uint16_t -> int16_t
  // being implemented as the inverse of int16_t -> uint16_t,
  // which is not mandated by the standard.
  t = (int16_t) u;
  // By construction, the LHS is divisible by 2^16
  t = (a - (int32_t)t*KYBER_Q) >> 16;
  return t;
}

(NB: IIUC, semantically the arithmetic (uint16_t) a * QINV would still be signed (if sizeof(int) > 2) due to integer promotion, but that doesn't matter, again since cast to an unsigned type is always well-defined)

@cryptojedi
Copy link
Contributor

@rod-chapman Sorry that I didn't follow up on this for so long. Can you share the version you wrote that avoids the implementation-defined behavior?

@rod-chapman
Copy link
Author

I'll have a look...

@rod-chapman
Copy link
Author

Let's start at the beginning - can we arrive at a post-condition for montgomery_reduce(), expressed as a legal and well-defined C expression? I'm struggling a bit with this. R**-1 (mod Q) where R==2**16 is 169 as far as I can tell, so for
r = montgomery_reduce(x);
we require
r % Q == (x * 169) % Q
but that doesn't seem to work because C's % operator doesn't work that way for negative left operand.

Any ideas?

@cryptojedi
Copy link
Contributor

The two post-conditions are
(r*(1<<16)-x)%KYBER_Q == 0
and
r >= -(KYBER_Q+1) && r <= KYBER_Q-1

@rod-chapman
Copy link
Author

The comment in kyber/ref/reduce.c says
Returns: integer in {-q+1,...,q-1} congruent to a * R^-1 modulo q.
so is it really supposed to be
r >= -(KYBER_Q+1) // as you have above
or is it
r >= (-KYBER_Q)+1

??

@rod-chapman
Copy link
Author

OK... I have an implementation which I think is well-defined, portable, and proves correct with CBMC.
BUT... it's slow - 14 instructions on AArch64 compared to 7 for the original, using gcc 13.1.0 -O3
It also depends on a conditional subtraction that is not guaranteed 100% to be constant-time.

@cryptojedi
Copy link
Contributor

cryptojedi commented Sep 8, 2024 via email

@rod-chapman
Copy link
Author

rod-chapman commented Sep 9, 2024

Yes... I too am concerned about the conditional subtraction being compiled to a branch. The "right" way to do it would be to generate a "mask" value using Dan Bernstein's library, but that will further erode performance.

And yes... I am assuming that % and / by a power of two will be implemented as a mask or shift reglardless of optimization level.

I am sympathetic that leaving the code alone and documenting the assumptions properly is better for now.

hanno-becker added a commit to pq-code-package/mlkem-native that referenced this issue Oct 11, 2024
hanno-becker added a commit to pq-code-package/mlkem-native that referenced this issue Oct 11, 2024
hanno-becker added a commit to pq-code-package/mlkem-native that referenced this issue Oct 11, 2024
hanno-becker added a commit to pq-code-package/mlkem-native that referenced this issue Oct 11, 2024
* Remove redundant `poly_reduce()` calls

The typical flow of arithmetic data is as follows:
- Forward NTT
- Basemul
- One of:
  * Remove Montgomery factor + invNTT
  * Remove Montgomery factor
- Reduction
- Serialization

At present, `poly_reduce()` features in multple places in this flow,
serving multiple purposes:

First, it is used after the fwd NTT and after the base mul. The purpose
of those `poly_reduce()` calls is to merely keep the data small, with
details of canonicity and sign being irrelevant.

Second, `poly_reduce()` is used ahead of serialization:
Here, we do need _unsigned_ canonical representatives. The C and AArch64
versions of `poly_reduce()` produce _signed_ canonical Barret reduction,
forcing the serialization functions to complete the normalization through a
conditional addition. The AVX2 code uses an _unsigned_ 'almost-canonical'
`poly_reduce()`, giving a representative in [0,q] (inclusive upper bound!)
which are subject to conditional _subtraction_ during serialization.

In a nutshell, conceptually we'd like an unsigned canonical reduction
ahead of serialization, and non-strict (faster) reduction everywhere else.

Looking closer, the uses of `poly_reduce()` after the forward NTT and after
the basemul appear unnecessary.

Reduction after basemul:
- Basemul output fed into `poly_tomont()` is reduced through the Montgomery
  multiplication that's done in `poly_tomont()`.
- Basemul output fed into inv NTT is potentially problematic if not reduced:
  The invNTT does currently _not_ reduce its input. _However_, the invNTT
  does need a scaling by `mont^2/128`. This scaling is currently happening
  at the _end_ of the invNTT. Instead moving it to the _start_ of the invNTT
  reduces the input in the same way as `poly_tomont()` does.
  This change affects both the C reference NTT as well as the AArch64 NTT.

Reduction after fwd NTT: A reduction after the forward NTT is not needed
since base multiplication does not overflow provided _one_ factor is bound
by q, which is always the case in MLKEM.

Signed-off-by: Hanno Becker <[email protected]>

* Document and runtime-check bounds for arithmetic data

This commit documents bounds on arithmetic data as it flows
through common operations like NTT, invNTT and basemul.

Moreover, it introduces debug macros like POLY_BOUND or
POLYVEC_BOUND which can be used in debug builds to check
the documented bounds at runtime. Moreover, at a later
point those assertions should be proved by converting them
to CBMC assertions.

Finally, the bounds show that canonicity of the result of
the NTT is not needed, and this commit removes the Barrett
reduction at the end of the AArch64 ASM NTT.

Signed-off-by: Hanno Becker <[email protected]>

* Reduce pk & sk after unpacking

It's not standards-required for the unpacked sk to be be reduced
(its coefficients can be up to 2^12 in size), but our bounds
reasoning assumes it to be.

For pk, it must be checked at the top-level that the byte stream
unpacks to coefficients in bound, which however has not yet been
implemented. Until that's done, reduce the pk after unpacking, so
that the lower level routines are only called with canonical pk.

Signed-off-by: Hanno Becker <[email protected]>

* Add Python script confirming Barrett/Montgomery relation and bound

Signed-off-by: Hanno Becker <[email protected]>

* Remove TODO in AArch64 intt_clean.S

Signed-off-by: Hanno Becker <[email protected]>

* Add bound for fwd NTT and static non-overflow assertion in keygen

Signed-off-by: Hanno Becker <[email protected]>

* Fix formatting

Signed-off-by: Hanno Becker <[email protected]>

* Address further review feedback

Signed-off-by: Hanno Becker <[email protected]>

* Introduce single contractual bound for [inv]NTT output

Reasoning about safety of the C 'frontend' should not depend
on details of the arithmetic backend (ref/native). We thus
introduce single bounds NTT_BOUND and INVNTT_BOUND on the
absolute values of the output of the [inv]NTT that any
implementation needs to satisfy.

For every specific implementation, we still define and check
(in debug builds) for tighter bounds, plus add a static assertion
that the implementation-specific bound is smaller than the
contractual bound.

Signed-off-by: Hanno Becker <[email protected]>

* Rewrite bounds-checking macros to work with poly and poly_mulcache

Signed-off-by: Hanno Becker <[email protected]>

* Fix typo in ntt.c

Signed-off-by: Hanno Becker <[email protected]>

* Document output bounds for poly_mulcache_compute()

Signed-off-by: Hanno Becker <[email protected]>

* Document+Check input bound for polyvec_basemul_acc_montgomery_cached

Signed-off-by: Hanno Becker <[email protected]>

* Add input bounds check for NTT

Signed-off-by: Hanno Becker <[email protected]>

* Make wording of bounds estimates more accessible

Signed-off-by: Hanno Becker <[email protected]>

* Document implementation-defined C behaviour in montgomery_reduce()

See pq-crystals/kyber#77

Signed-off-by: Hanno Becker <[email protected]>

* Run functional CI in debugging mode

Signed-off-by: Hanno Becker <[email protected]>

---------

Signed-off-by: Hanno Becker <[email protected]>
@hanno-becker
Copy link

There is similar implementation-defined behaviour in barrett_reduce(). We merged code with suitable PORTABILITY sections into mlkem-c-aarch64, see https://github.com/pq-code-package/mlkem-c-aarch64/blob/main/mlkem/reduce.c. Let us know if you'd like an upstream patch along those lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants