Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FpDbl revisited #144

Merged
merged 18 commits into from
Feb 1, 2021
Merged

FpDbl revisited #144

merged 18 commits into from
Feb 1, 2021

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Jan 31, 2021

For fast Fp2, Fp4, Fp12, implementations we should take advantage of lazy reductions, see #15 (comment)

However this was put on hold due to an unexplained 50 cycles difference between the theory and practice as mentioned here:

# Single-width [3 Mul, 2 Add, 3 Sub]
# 3*81 + 2*14 + 3*12 = 307 theoretical cycles
# 330 measured
# Double-Width
# 316 theoretical cycles
# 365 measured
# Reductions can be 2x10 faster using MCL algorithm
# but there are still unexplained 50 cycles diff between theo and measured
# and unexplained 30 cycles between Clang and GCC
# - Function calls?
# - push/pop stack?

Since we do 2 reductions, and my CPU is now running at 3.9GHz compared 4.1, we have now found out the source of the differences between theoretical cycle count and practice.

The origin is due to nim-lang/Nim#16887 which made reduction 20 cycles slower than necessary and reduction is used twice in Fp2 multiplication.

This PR:

  • Fixes Montgomery reduction performance issue
  • Implement a slower Comba Montgomery reduction (scalar code only)
  • Implement specialized squaring, scalar and Assembly. No MULX/ADCX/ADOX code as it requires a different algorithm.
    Assembly squaring is as fast as ADX multiplication so we can expected ADX squaring to have an extra conservative 15% performance boost (up to 40% as you almost halves the number of operations).
  • Accelerate Fp2 Mul by 10% by fixing and all G1 operation by about 7% by removing copies that lead to bad codegen in FpAdd FpSub:
    when UseASM_X86_64 and a.mres.limbs.len <= 6: # TODO: handle spilling
    r = a
    addmod_asm(r.mres.limbs, b.mres.limbs, FF.fieldMod().limbs)

@mratsim mratsim merged commit 83dcd98 into master Feb 1, 2021
@mratsim mratsim deleted the fpdbl-revisited branch February 1, 2021 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant