Internal API: in-place vs result #21

mratsim · 2020-04-11T12:09:32Z

Currently Constantine is using the following internal API for add, sub, mul:

func `+=`*(a: var Fp, b: Fp)
func sum*(r: var Fp, a, b: Fp)
func `-=`*(a: var Fp, b: Fp)
func diff*(r: var Fp, a, b: Fp)
func prod*(r: var Fp, a, b: Fp)

However, should we instead use the following?

func `+=`*(a: var Fp, b: Fp)
func `+`*(a, b: Fp): Fp
func `-=`*(a: var Fp, b: Fp)
func `-`*(a, b: Fp): Fp
func `*`*(r: var Fp, a, b: Fp)

Tradeoffs

Ergonomics

While those APIs are internal, they are used in some quite long formulas and it's easier to implement them without mistakes if we closely follow the mathematical notation, for example for Algorithm 7 below

Performance

There are a couple potential performance issue.
Note that Fp is often Fp12 in pairing cases, which on the BLS12-381 curve (6 64-bit limbs) is of size 12 * 6 * 8 bytes = 576 bytes and so a huge parameter to pass by value semantics.

Parameter-passing of arrays

To be investigated, if we have a static array as result values, are they passed by stack? Or are they preallocated in the caller context and passed by EAX/RAX register.

Register vs memory

Key operations like add-with-carry behave differently if the destination is a memory location or a register. The latency can be up to 6 cycles instead of 1 cycle, see Agner Fog tables: https://www.agner.org/optimize/instruction_tables.pdf

Nehalem (2008)

Haswell (2013)

Skylake (2015)

Skylake-X (2015 - latest for consumer because Intel cannot produce new architectures)

Left highlighted column is latency: how many cycles do you have to wait if you depend on this result for further operation (like chained in-place additions or depending on a carry).

Right highlighted column is reciprocal throughput, how many cycles do you need to issue independent instructions. A value of 0.25 means 4 independent additions per cycle while a value of 2 means 1 every 2 cycles.

The last case ADC SBB m,r/i which is add-with-carry / sub-with-borrow: memory <- register, must absolutely be avoided due to a low throughput (once every 2 cycles) and high latency (5 cycles to wait) while the other order register <- memory has thoughput and latency of 1 cycle.

(Note: we can avoid the carry case all together with lazy carry but even though throughput is 4 per cycles instead of 1 per cycle, overall it is very likely to be slower due to more memory used (and multiplication costs growing in n² with n the number of limbs) and costly reduction operations in the general case see #15 (comment))

If the API is using var parameter that means we need a temporary variable in register anyway to store the result and then copy it to the destination.

In the second case, the compiler knows that the result is already a temporary variable on the stack.

The cost is potentially more stack usage due to those temporaries as the compiler does not directly construct the result in the destination buffer (i.e. we don't have copy-elision)

{.noInit.}

Nim zero-initializes every variable before use. The C compiler can usually detect and discard redundant initializations, for example the following:

let a = 5

gets compiled into

NI a;
a = (NI)0;
a = (NI)5;

While this avoids a complete bug class, this when working with value object of size over 100 bytes (or even 576 bytes for FP12 elements in pairing) we have the following problems:

Does the compiler recognize and elide the zero-initialization?
If not, can we elide it manually?

For manual elision this can be done with {.noInit.}:
With in-place sum

var r{.noInit.}: FP12[BLS12_381]
r.sum(a, b)

However with a result value

proc `+`(a, b: Fp12) Fp12 {.noInit.} =
  discard "implementation"

let r = a + b

The {.noInit.} applies to the function implicit result value, but does the final assignation to r, zero-initialize r under-the-hood and so do we need

proc `+`(a, b: Fp12) Fp12 {.noInit.} =
  discard "implementation"

let r {.noInit.} = a + b

The text was updated successfully, but these errors were encountered:

mratsim · 2020-04-13T19:28:58Z

Note: this requires concepts to work for result:

either the natural way

func `+`*(a, b: CubicExtAddGroup): CubicExtAddGroup =
  result.sum(a, b)

func `-`*(a, b: CubicExtAddGroup): CubicExtAddGroup =
  result.sum(a, b)

or with a workaround in the vein of

func `+`*(a, b: CubicExtAddGroup): typeof(a) =
  result.sum(a, b)

func `-`*(a, b: CubicExtAddGroup): typeof(a) =
  result.sum(a, b)

Or alternatively implement the add/neg/sub/sum/diff proc for each of the extension fields ...

mratsim added performance 🏁 ergonomics 💯 labels Jun 6, 2020

mratsim closed this as completed Jan 1, 2022

mratsim mentioned this issue May 10, 2022

Routines for working with threshold signatures status-im/nim-blscurve#138

Merged

This was referenced Jul 27, 2024

Crandall primes #445

Merged

Low-level: discrepancy between field arithmetic performance and elliptic curve performance #446

Open

mratsim mentioned this issue Dec 12, 2024

Add ECDSA over secp256k1 signatures and verification #490

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal API: in-place vs result #21

Internal API: in-place vs result #21

mratsim commented Apr 11, 2020

mratsim commented Apr 13, 2020

Internal API: in-place vs result #21

Internal API: in-place vs result #21

Comments

mratsim commented Apr 11, 2020

Tradeoffs

Ergonomics

Performance

Parameter-passing of arrays

Register vs memory

{.noInit.}

mratsim commented Apr 13, 2020