You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently Constantine is using the following internal API for add, sub, mul:
func `+=`*(a: var Fp, b: Fp)
func sum*(r: var Fp, a, b: Fp)
func `-=`*(a: var Fp, b: Fp)
func diff*(r: var Fp, a, b: Fp)
func prod*(r: var Fp, a, b: Fp)
However, should we instead use the following?
func `+=`*(a: var Fp, b: Fp)
func `+`*(a, b: Fp): Fp
func `-=`*(a: var Fp, b: Fp)
func `-`*(a, b: Fp): Fp
func `*`*(r: var Fp, a, b: Fp)
Tradeoffs
Ergonomics
While those APIs are internal, they are used in some quite long formulas and it's easier to implement them without mistakes if we closely follow the mathematical notation, for example for Algorithm 7 below
Performance
There are a couple potential performance issue.
Note that Fp is often Fp12 in pairing cases, which on the BLS12-381 curve (6 64-bit limbs) is of size 12 * 6 * 8 bytes = 576 bytes and so a huge parameter to pass by value semantics.
Parameter-passing of arrays
To be investigated, if we have a static array as result values, are they passed by stack? Or are they preallocated in the caller context and passed by EAX/RAX register.
Register vs memory
Key operations like add-with-carry behave differently if the destination is a memory location or a register. The latency can be up to 6 cycles instead of 1 cycle, see Agner Fog tables: https://www.agner.org/optimize/instruction_tables.pdf
Nehalem (2008)
Haswell (2013)
Skylake (2015)
Skylake-X (2015 - latest for consumer because Intel cannot produce new architectures)
Left highlighted column is latency: how many cycles do you have to wait if you depend on this result for further operation (like chained in-place additions or depending on a carry).
Right highlighted column is reciprocal throughput, how many cycles do you need to issue independent instructions. A value of 0.25 means 4 independent additions per cycle while a value of 2 means 1 every 2 cycles.
The last case ADC SBB m,r/i which is add-with-carry / sub-with-borrow: memory <- register, must absolutely be avoided due to a low throughput (once every 2 cycles) and high latency (5 cycles to wait) while the other order register <- memory has thoughput and latency of 1 cycle.
(Note: we can avoid the carry case all together with lazy carry but even though throughput is 4 per cycles instead of 1 per cycle, overall it is very likely to be slower due to more memory used (and multiplication costs growing in n² with n the number of limbs) and costly reduction operations in the general case see #15 (comment))
If the API is using var parameter that means we need a temporary variable in register anyway to store the result and then copy it to the destination.
In the second case, the compiler knows that the result is already a temporary variable on the stack.
The cost is potentially more stack usage due to those temporaries as the compiler does not directly construct the result in the destination buffer (i.e. we don't have copy-elision)
{.noInit.}
Nim zero-initializes every variable before use. The C compiler can usually detect and discard redundant initializations, for example the following:
let a =5
gets compiled into
NIa;
a= (NI)0;
a= (NI)5;
While this avoids a complete bug class, this when working with value object of size over 100 bytes (or even 576 bytes for FP12 elements in pairing) we have the following problems:
Does the compiler recognize and elide the zero-initialization?
If not, can we elide it manually?
For manual elision this can be done with {.noInit.}:
With in-place sum
var r{.noInit.}: FP12[BLS12_381]
r.sum(a, b)
However with a result value
proc`+`(a, b: Fp12) Fp12 {.noInit.} =discard"implementation"let r = a + b
The {.noInit.} applies to the function implicit result value, but does the final assignation to r, zero-initialize r under-the-hood and so do we need
proc`+`(a, b: Fp12) Fp12 {.noInit.} =discard"implementation"let r {.noInit.} = a + b
The text was updated successfully, but these errors were encountered:
Currently Constantine is using the following internal API for add, sub, mul:
However, should we instead use the following?
Tradeoffs
Ergonomics
While those APIs are internal, they are used in some quite long formulas and it's easier to implement them without mistakes if we closely follow the mathematical notation, for example for Algorithm 7 below
![image](https://user-images.githubusercontent.com/22738317/79041890-91b44000-7bf3-11ea-89bb-5e17cebf41dc.png)
Performance
There are a couple potential performance issue.
Note that
Fp
is oftenFp12
in pairing cases, which on the BLS12-381 curve (6 64-bit limbs) is of size12 * 6 * 8 bytes = 576 bytes
and so a huge parameter to pass by value semantics.Parameter-passing of arrays
To be investigated, if we have a static array as result values, are they passed by stack? Or are they preallocated in the caller context and passed by EAX/RAX register.
Register vs memory
Key operations like add-with-carry behave differently if the destination is a memory location or a register. The latency can be up to 6 cycles instead of 1 cycle, see Agner Fog tables: https://www.agner.org/optimize/instruction_tables.pdf
Nehalem (2008)
![image](https://user-images.githubusercontent.com/22738317/79042468-68e27980-7bf8-11ea-944b-e18e89a2a1c3.png)
![image](https://user-images.githubusercontent.com/22738317/79042485-8a436580-7bf8-11ea-924b-00811cf713ec.png)
![image](https://user-images.githubusercontent.com/22738317/79042598-66ccea80-7bf9-11ea-8e33-04ac93e130ee.png)
![image](https://user-images.githubusercontent.com/22738317/79042633-bf03ec80-7bf9-11ea-9e06-19abc8afb50a.png)
Haswell (2013)
Skylake (2015)
Skylake-X (2015 - latest for consumer because Intel cannot produce new architectures)
Left highlighted column is latency: how many cycles do you have to wait if you depend on this result for further operation (like chained in-place additions or depending on a carry).
Right highlighted column is reciprocal throughput, how many cycles do you need to issue independent instructions. A value of 0.25 means 4 independent additions per cycle while a value of 2 means 1 every 2 cycles.
The last case
ADC SBB m,r/i
which is add-with-carry / sub-with-borrow: memory <- register, must absolutely be avoided due to a low throughput (once every 2 cycles) and high latency (5 cycles to wait) while the other order register <- memory has thoughput and latency of 1 cycle.(Note: we can avoid the carry case all together with lazy carry but even though throughput is 4 per cycles instead of 1 per cycle, overall it is very likely to be slower due to more memory used (and multiplication costs growing in n² with n the number of limbs) and costly reduction operations in the general case see #15 (comment))
If the API is using var parameter that means we need a temporary variable in register anyway to store the result and then copy it to the destination.
In the second case, the compiler knows that the result is already a temporary variable on the stack.
The cost is potentially more stack usage due to those temporaries as the compiler does not directly construct the result in the destination buffer (i.e. we don't have copy-elision)
{.noInit.}
Nim zero-initializes every variable before use. The C compiler can usually detect and discard redundant initializations, for example the following:
gets compiled into
While this avoids a complete bug class, this when working with value object of size over 100 bytes (or even 576 bytes for FP12 elements in pairing) we have the following problems:
For manual elision this can be done with {.noInit.}:
With in-place sum
However with a result value
The {.noInit.} applies to the function implicit
result
value, but does the final assignation tor
, zero-initialize r under-the-hood and so do we needThe text was updated successfully, but these errors were encountered: