Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: old school slicing should generate inline code or do constant-folding #16887

Open
mratsim opened this issue Jan 31, 2021 · 2 comments

Comments

@mratsim
Copy link
Collaborator

mratsim commented Jan 31, 2021

I was trying to optimize extremely performance sensitive big integer / cryptographic code where somehow I couldn't reach the speed of other Assembly implementations.

It turned out I was initializing an array with this line

`scratchSym`[0 .. `N`-1] = `a_MR`.toOpenArray(0, `N`-1)

Using this line instead, with staticFor being a compile-time loop unroller improved performance by 30%, previously my code took 70 cycles and now it takes 50 cycles.

staticFor i, 0, `N`: # Do NOT use Nim slice/toOpenArray, they are not inlined
    `scratchSym`[i] = `a_MR`[i]

(state-of-the art C++ JIT-ed code takes 55 cycles on my machine).

Profiling:
image

image

C code

N_LIB_PRIVATE N_NIMCALL(void, montRed_asm_adx_bmi2__ENBcQkN09aE9aDZ9c67ohtYkg)(NU64* r, tyArray__QpmKhiiBZ9b9bMXYThob0NHQ t, tyArray__4RmROn7lE6QlXejY1MVycQ M, NU64 m0ninv) {
	NU64 hi;
	NU64 lo;
	NU64 rdx;
	tyArray__BoAqkBLQz0ZcxF5EXnmcOA scratch;
	tyObject_HSlice__EE5dzjqoOrHT6HJhIPXAvA T1_;
	tyArray__T9bTwSavBMQqAy6Syjcf55Q tsub;
	nimfr_("montRed_asm_adx_bmi2", "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	nimln_(126, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	rdx = m0ninv;
	nimln_(125, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	nimln_(127, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	T1_ = dotdot___BokNSDrKN1xmV1nA01G9brAsystem(((NI) 0), ((NI) 5));
	if (((NI) 5)-((NI) 0) != -1 && (((NI) 5)-((NI) 0) < -1 || ((NI) 0) < 0 || ((NI) 0) > 11 || ((NI) 5) < 0 || ((NI) 5) > 11)){ raiseIndexError(); }
	nimln_(125, "/home/beta/Programming/Nim/constantine/constantine/arithmetic/assembly/limbs_asm_montred_x86_adx_bmi2.nim");
	X5BX5Deq___1R7esKTmlP09aog9cZijvMMg(scratch, T1_, (NU64*)((t)+(((NI) 0))), (((NI) 5))-(((NI) 0))+1);
	nimln_(175, "/home/beta/Programming/Nim/constantine/constantine/primitives/macro_assembler_x86.nim");
	asm volatile(
"# \n"
"imulq %[scratch0], %[rdx]\n"
"# ---- Reduction 0\n"
"xorq %[scratch6], %[scratch6]\n"
"# \n"
"mulxq (%[M]), %[lo], %[hi]\n"
"adcxq %[lo], %[scratch0]\n"
"adoxq %[hi], %[scratch1]\n"
"# \n"
"mulxq 8(%[M]), %[lo], %[hi]\n"
"adcxq %[lo], %[scratch1]\n"
"adoxq %[hi], %[scratch2]\n"
// .............
@Araq
Copy link
Member

Araq commented Feb 1, 2021

Slicing via scratchSym[0 .. N-1] is the problem here, not toOpenarray.

@Araq Araq changed the title Performance: toOpenArray should generate inline code or do constant-folding Performance: old school slicing should generate inline code or do constant-folding Feb 1, 2021
@mratsim
Copy link
Collaborator Author

mratsim commented Feb 1, 2021

Ah indeed, so this should be inlined: https://github.com/nim-lang/Nim/blob/version-1-4/lib/system.nim#L2528-L2540

Nim/lib/system.nim

Lines 2528 to 2540 in 1d8b7aa

proc `[]=`*[Idx, T, U, V](a: var array[Idx, T], x: HSlice[U, V], b: openArray[T]) =
## Slice assignment for arrays.
##
## .. code-block:: Nim
## var a = [10, 20, 30, 40, 50]
## a[1..2] = @[99, 88]
## assert a == [10, 99, 88, 40, 50]
let xa = a ^^ x.a
let L = (a ^^ x.b) - xa + 1
if L == b.len:
for i in 0..<L: a[Idx(i + xa)] = b[i]
else:
sysFatal(RangeDefect, "different lengths for slice assignment")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants