FpDbl revisited #144

mratsim · 2021-01-31T16:50:01Z

For fast Fp2, Fp4, Fp12, implementations we should take advantage of lazy reductions, see #15 (comment)

However this was put on hold due to an unexplained 50 cycles difference between the theory and practice as mentioned here:

constantine/constantine/tower_field_extensions/quadratic_extensions.nim

Lines 129 to 139 in d12d5fa

    
           # Single-width [3 Mul, 2 Add, 3 Sub] 
        
           #    3*81 + 2*14 + 3*12 = 307 theoretical cycles 
        
           #    330 measured 
        
           # Double-Width 
        
           #    316 theoretical cycles 
        
           #    365 measured 
        
           #    Reductions can be 2x10 faster using MCL algorithm 
        
           #    but there are still unexplained 50 cycles diff between theo and measured 
        
           #    and unexplained 30 cycles between Clang and GCC 
        
           #    - Function calls? 
        
           #    - push/pop stack?

Since we do 2 reductions, and my CPU is now running at 3.9GHz compared 4.1, we have now found out the source of the differences between theoretical cycle count and practice.

The origin is due to nim-lang/Nim#16887 which made reduction 20 cycles slower than necessary and reduction is used twice in Fp2 multiplication.

This PR:

Fixes Montgomery reduction performance issue
Implement a slower Comba Montgomery reduction (scalar code only)
Implement specialized squaring, scalar and Assembly. No MULX/ADCX/ADOX code as it requires a different algorithm.
Assembly squaring is as fast as ADX multiplication so we can expected ADX squaring to have an extra conservative 15% performance boost (up to 40% as you almost halves the number of operations).

Accelerate Fp2 Mul by 10% by fixing and all G1 operation by about 7% by removing copies that lead to bad codegen in FpAdd FpSub:

constantine/constantine/arithmetic/finite_fields.nim

Lines 164 to 166 in d12d5fa

    
           when UseASM_X86_64 and a.mres.limbs.len <= 6: # TODO: handle spilling 
        
             r = a 
        
             addmod_asm(r.mres.limbs, b.mres.limbs, FF.fieldMod().limbs)

mratsim added 18 commits January 30, 2021 16:06

reorg mul -> limbs_double_width, ConstantineASM CttASM

5328a9f

Implement squaring specialized scalar path (22% faster than mul)

7e5a706

Implement "portable" assembly for squaring

c7c0e0e

stash part of the changes

4b51126

Reorg montgomery reduction - prepare to introduce Comba optimization

63cae2f

Implement comba Montgomery reduce (but it's slower!)

051a373

rename t -> a

3044624

30% performance improvement by avoiding toOpenArray!

73493e5

variable renaming

25bb0e0

Fix 32-bit imports

7d258aa

slightly better assembly for sub2x

f192e0a

There is an annoying bottleneck

4f326b2

use out-of-place Fp assembly instead of in-place

8dd6be8

diffAlias is unneeded now

7db08c4

cosmetic

468b4d4

speedup fpDbl sub by 20%

d4d2813

Fix Fp2 -> Fp6 -> Fp12 towering. It seems 5% faster

440bd9a

Stash ADCX/ADOX squaring

a5f76dc

mratsim merged commit 83dcd98 into master Feb 1, 2021

mratsim deleted the fpdbl-revisited branch February 1, 2021 02:52

This was referenced Feb 6, 2021

Reactivate fast squaring algorithms #68

Closed

Double-Precision towering #155

Merged

Double-precision cubic towering + pairing #158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FpDbl revisited #144

FpDbl revisited #144

mratsim commented Jan 31, 2021 •

edited

Loading

	# Single-width [3 Mul, 2 Add, 3 Sub]
	# 381 + 214 + 3*12 = 307 theoretical cycles
	# 330 measured
	# Double-Width
	# 316 theoretical cycles
	# 365 measured
	# Reductions can be 2x10 faster using MCL algorithm
	# but there are still unexplained 50 cycles diff between theo and measured
	# and unexplained 30 cycles between Clang and GCC
	# - Function calls?
	# - push/pop stack?

	when UseASM_X86_64 and a.mres.limbs.len <= 6: # TODO: handle spilling
	r = a
	addmod_asm(r.mres.limbs, b.mres.limbs, FF.fieldMod().limbs)

FpDbl revisited #144

FpDbl revisited #144

Conversation

mratsim commented Jan 31, 2021 • edited Loading

mratsim commented Jan 31, 2021 •

edited

Loading