Performance: old school slicing should generate inline code or do constant-folding #16887

mratsim · 2021-01-31T16:37:56Z

I was trying to optimize extremely performance sensitive big integer / cryptographic code where somehow I couldn't reach the speed of other Assembly implementations.

It turned out I was initializing an array with this line

`scratchSym`[0 .. `N`-1] = `a_MR`.toOpenArray(0, `N`-1)

Using this line instead, with staticFor being a compile-time loop unroller improved performance by 30%, previously my code took 70 cycles and now it takes 50 cycles.

staticFor i, 0, `N`: # Do NOT use Nim slice/toOpenArray, they are not inlined
    `scratchSym`[i] = `a_MR`[i]

(state-of-the art C++ JIT-ed code takes 55 cycles on my machine).

Profiling:

C code

N_LIB_PRIVATE N_NIMCALL(void, montRed_asm_adx_bmi2__ENBcQkN09aE9aDZ9c67ohtYkg)(NU64* r, tyArray__QpmKhiiBZ9b9bMXYThob0NHQ t, tyArray__4RmROn7lE6QlXejY1MVycQ M, NU64 m0ninv) {
	NU64 hi;
	NU64 lo;
	NU64 rdx;
	tyArray__BoAqkBLQz0ZcxF5EXnmcOA scratch;
	tyObject_HSlice__EE5dzjqoOrHT6HJhIPXAvA T1_;
	tyArray__T9bTwSavBMQqAy6Syjcf55Q tsub;
	nimfr_("montRed_asm_adx_bmi2", "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	nimln_(126, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	rdx = m0ninv;
	nimln_(125, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	nimln_(127, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	T1_ = dotdot___BokNSDrKN1xmV1nA01G9brAsystem(((NI) 0), ((NI) 5));
	if (((NI) 5)-((NI) 0) != -1 && (((NI) 5)-((NI) 0) < -1 || ((NI) 0) < 0 || ((NI) 0) > 11 || ((NI) 5) < 0 || ((NI) 5) > 11)){ raiseIndexError(); }
	nimln_(125, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	X5BX5Deq___1R7esKTmlP09aog9cZijvMMg(scratch, T1_, (NU64*)((t)+(((NI) 0))), (((NI) 5))-(((NI) 0))+1);
	nimln_(175, "/home/beta/Programming/Nim/constantine/constantine/primitives/macro_assembler_x86.nim");
	asm volatile(
"# \n"
"imulq %[scratch0], %[rdx]\n"
"# ---- Reduction 0\n"
"xorq %[scratch6], %[scratch6]\n"
"# \n"
"mulxq (%[M]), %[lo], %[hi]\n"
"adcxq %[lo], %[scratch0]\n"
"adoxq %[hi], %[scratch1]\n"
"# \n"
"mulxq 8(%[M]), %[lo], %[hi]\n"
"adcxq %[lo], %[scratch1]\n"
"adoxq %[hi], %[scratch2]\n"
// .............

The text was updated successfully, but these errors were encountered:

Araq · 2021-02-01T12:39:48Z

Slicing via scratchSym[0 .. N-1] is the problem here, not toOpenarray.

mratsim · 2021-02-01T12:43:28Z

Ah indeed, so this should be inlined: https://github.com/nim-lang/Nim/blob/version-1-4/lib/system.nim#L2528-L2540

Nim/lib/system.nim

Lines 2528 to 2540 in 1d8b7aa

    
           proc `[]=`*[Idx, T, U, V](a: var array[Idx, T], x: HSlice[U, V], b: openArray[T]) = 
        
             ## Slice assignment for arrays. 
        
             ## 
        
             ## .. code-block:: Nim 
        
             ##   var a = [10, 20, 30, 40, 50] 
        
             ##   a[1..2] = @[99, 88] 
        
             ##   assert a == [10, 99, 88, 40, 50] 
        
             let xa = a ^^ x.a 
        
             let L = (a ^^ x.b) - xa + 1 
        
             if L == b.len: 
        
               for i in 0..<L: a[Idx(i + xa)] = b[i] 
        
             else: 
        
               sysFatal(RangeDefect, "different lengths for slice assignment")

mratsim added the Performance label Jan 31, 2021

mratsim mentioned this issue Jan 31, 2021

FpDbl revisited mratsim/constantine#144

Merged

4 tasks

Araq changed the title ~~Performance: toOpenArray should generate inline code or do constant-folding~~ Performance: old school slicing should generate inline code or do constant-folding Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: old school slicing should generate inline code or do constant-folding #16887

Performance: old school slicing should generate inline code or do constant-folding #16887

mratsim commented Jan 31, 2021 •

edited

Loading

Araq commented Feb 1, 2021

mratsim commented Feb 1, 2021 •

edited

Loading

Performance: old school slicing should generate inline code or do constant-folding #16887

Performance: old school slicing should generate inline code or do constant-folding #16887

Comments

mratsim commented Jan 31, 2021 • edited Loading

Araq commented Feb 1, 2021

mratsim commented Feb 1, 2021 • edited Loading

mratsim commented Jan 31, 2021 •

edited

Loading

mratsim commented Feb 1, 2021 •

edited

Loading