Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problem with memory loads/stores #10301

Closed
epn opened this issue Feb 23, 2015 · 23 comments
Closed

Performance problem with memory loads/stores #10301

epn opened this issue Feb 23, 2015 · 23 comments
Labels
performance Must go faster

Comments

@epn
Copy link

epn commented Feb 23, 2015

Julia has the following performance problem with array accesses.
Consider the following code snippets.

# c and x are arrays of size n
function optimized(c, x)
  n = size(x, 1)
  x_ = x[n] 
  for i = n-1:-1:1
    x_  = x[i] = (x[i] - c[i + 1] * x_)
  end
end

function unoptimized(c, x)
  n = size(x, 1)
  for i = n-1:-1:1
    x[i] = (x[i] - c[i + 1] * x[i + 1])
  end
end

Both functions do the same computation.
There is a loop-carry dependency in this code, that is, x [i] depends on the value of x [i+1].

While "optimized" forces reading x [i+1] from the local variable x_, "unoptimized" simply accesses x [i+1] .

For n = 100 million, and for 100 runs of each function on my macbook air, i have the following runtimes.

optimized : min 0.307929443 max 0.448113027 mean 0.31504216288000003 variance 0.00036038772082227653
unoptimized : min 0.508484855 max 0.674296101 mean 0.54576100111 variance 0.001543397115546402

Similar runtimes were obtained from using julia -O.

I haven't looked at the assembly code, but it seems that the unoptimized version loads x[i+1] from memory though x [i+1] was just computed in the previous iteration.

@JeffBezanson JeffBezanson added the performance Must go faster label Feb 23, 2015
@JeffBezanson
Copy link
Member

Acknowledged. Your guess about the generated code is probably right.

This seems like a very difficult optimization to obtain. @ArchRobison any idea how realistic this is?

@andreasnoack
Copy link
Member

We realized this while working on a slightly more complicated problem. In a C implementation of the more complicated problem complied with Clang, the extra load from memory was optimized away whenever we used an optimization level greater than one. Therefore, we saw the same difference as @epn mentioned above whenever we compiled with cc -O1 but when using cc -O3 both versions had the speed of optimized.

Unfortunately, Clang optimizes the examples in this issue slightly differently from the more complicated problem and Clang is actually not able to remove the extra load from memory here. Instead the function is inlined and the speed gain is the same. In contrast gcc-4.9 optimizes the extra memory load away.

Assembly inner loops

Julia unoptimized

L116:   lea RSI, QWORD PTR [8*RAX]
    mov RDI, R9
    sub RDI, RSI
    mov RBX, R8
    sub RBX, RSI
    vmovsd  XMM0, QWORD PTR [RBX]
    vmulsd  XMM0, XMM0, QWORD PTR [RDI]
    vmovsd  XMM1, QWORD PTR [RDI - 8]
    vsubsd  XMM0, XMM1, XMM0
    vmovsd  QWORD PTR [RDI - 8], XMM0
    sub RAX, RDX
    cmp RCX, RAX
    jne L116

Clang -O1 and -O3

## BB#1:                                ## %.lr.ph.preheader
Ltmp6:
    ##DEBUG_VALUE: test:n <- RDI
    ##DEBUG_VALUE: test:c <- RSI
    ##DEBUG_VALUE: test:x <- RDX
    .loc    1 11 0                  ## test.c:11:0
    dec rdi
Ltmp7:
    .align  4, 0x90
LBB0_2:                                 ## %.lr.ph
                                        ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: test:c <- RSI
    ##DEBUG_VALUE: test:x <- RDX
    movsd   xmm0, qword ptr [rsi + 8*rdi]
    mulsd   xmm0, qword ptr [rdx + 8*rdi]
    movsd   xmm1, qword ptr [rdx + 8*rdi - 8]
    subsd   xmm1, xmm0
    movsd   qword ptr [rdx + 8*rdi - 8], xmm1
Ltmp8:
    .loc    1 9 0                   ## test.c:9:0
    dec rdi
    test    rdi, rdi
    jg  LBB0_2

CGG 4.9 with -O3 (in AT&T syntax because my gcc doesn't support Intel)

L3:
LM3:
    mulsd   8(%rsi,%rax,8), %xmm0
    movsd   (%rdx,%rax,8), %xmm1
    subsd   %xmm0, %xmm1
    movsd   %xmm1, (%rdx,%rax,8)
LM4:
    subq    $1, %rax
LVL2:
LM5:
    movapd  %xmm1, %xmm0
LM6:
    cmpq    $-1, %rax
    jne L3

I can post the example where Clang with -O3 optimized the load away, but the optimization is identical to the optimization that GCC makes in the example from this issue. The performance gain from hand optimizing the function by explicitly saving the value in a local variable was at least 30 pct. so being able optimize such code matters a lot for numerical computations.

@JeffBezanson
Copy link
Member

Interesting; perhaps there is another LLVM pass we can add in -O mode.

@timholy
Copy link
Member

timholy commented Feb 23, 2015

LLVM Polly comes to mind. One of its principal developers is interested in helping bring it to Julia, see https://groups.google.com/forum/?fromgroups=#!searchin/julia-dev/polly/julia-dev/lNPeqD6uZrQ/wv3TKzYutQsJ. It might just need someone to help him write a C version of popmeta! (on my plate, but there are so many other priorities...).

@jakebolewski
Copy link
Member

@timholy Clang is able to do this without Polly, so there is some pass (or ordering of passes) compared to Clang that enables this optimization (seems to kick in at -O2).

@ArchRobison
Copy link
Contributor

Julia 0.3.5 does this optimization just fine. All you have to do is write the loop the right way 😄 Regrettably this is a sign that we have a more widespread performance bug 😦 .

Here LLVM output for the loop, if written the right way. There are only 2 loads per iteration.

L:                                                ; preds = %L.preheader, %L
  %19 = phi double [ %26, %L ], [ %.pre, %L.preheader ], !dbg !188
  %i.0 = phi i64 [ %20, %L ], [ %8, %L.preheader ]
  %20 = add i64 %i.0, -1, !dbg !188
  %21 = getelementptr double* %14, i64 %20, !dbg !188
  %22 = load double* %21, align 8, !dbg !188, !tbaa %jtbaa_user
  %23 = getelementptr double* %18, i64 %i.0, !dbg !188
  %24 = load double* %23, align 8, !dbg !188, !tbaa %jtbaa_user
  %25 = fmul double %24, %19, !dbg !188
  %26 = fsub double %22, %25, !dbg !188
  store double %26, double* %21, align 8, !dbg !188, !tbaa %jtbaa_user
  %27 = icmp sgt i64 %20, 0, !dbg !191
  br i1 %27, label %L, label %L3, !dbg !191

Here is the code that generated it:

function unoptimized(c, x)
  n = size(x, 1)
  i = n-1
  @inbounds while i>0
    x [i] = (x [i] - c [i + 1] * x [i + 1])
    i -= 1
  end
end

code_llvm(unoptimized,(Vector{Float64},Vector{Float64}))

The construct in the original example that is causing the problem is the range n - 1 : -1 : 1. The constructor for it is not being inlined, and so when LLVM extracts the step, it treats it as a variable. Maybe there's an inliner tweak to fix this? Or should we look into implementing some trivial inter-procedural analysis to catch this sort of case? Constant-stride loops with stride!=1 seem worth optimizing.

@JeffBezanson
Copy link
Member

Fully generic StepRanges are unfortunately really complicated to handle. A lot of the logic is for handling overflow, so it can't be specialized away without knowing actual values. The quickest way would be some kind of UnsafeStepRange type, but of course that's pretty ugly.

@ArchRobison
Copy link
Contributor

Question: is there a way to dump the LLVM code for the constructor StepRange?

After looking at the constructor for StepRange, I concur that complete inlining is not the answer. What about partial inlining? Both start and step just pass through. It's computation of StepRange::stop that requires so much code. Maybe the computation of stop could be put in a separate function, so that the rest could be inlined?

I'm wondering about creating interprocedural analysis that could spot that step just passes through. Are there other examples of constructors where this optimization would be useful?

Allowing Julia-level rewrite rules, e.g. rewrite StepRange(a,b,c).step -> b might also work, but is obviously a major feature.

@andreasnoack
Copy link
Member

Okay, here is the slightly more complicated code to make sure that the discussion here covers the issue that started this. I don't really know what level of optimization to expect from Julia. The results below might suggest that we can improve slightly because Clang's -O3 is doing better than we are, but it also looks like the safest bet is to hand optimize and not rely on compiler optimizations.

Here is the code

function lsolve1(a, b, c, d, l, x)
    @inbounds begin
        n = size(x,1)
        d[1] = a[1]
        x[1] = b[1]
        f = c[1]
        l[1] = f
        r = f / d[1]
        s1 = f * r
        x1 = x[1] * r
        for i = 2:n - 1
            l_ = c[i] / d[i - 1]
            d[i] = a[i] - c[i] * l_
            x[i] = b[i] - x[i - 1] * l_
            f *= -l_
            l[i] = f
            r = f / d[i]
            s1 += f * r
            x1 += x[i] * r
        end
        l_ = c[n] / d[n - 1]
        d[n] = a[n] - c[n] * l_
        x[n] = b[n] - x[n - 1] * l_
        fp = - r * c[n]
    end
    s1, x1, fp, d[n], x[n]
end

where all arguments are n vectors except c which is n+1. This implementation can be hand optimized to only load from memory three times instead of eight times in the loop body. That is

function lsolve2(a, b, c, d, l, x)
    @inbounds begin
        n = size(x,1)
        di = a[1]
        d[1] = di
        xi = b[1]
        x[1] = xi;
        f = c[1]
        l[1] = f
        r = f / di
        s1 = f * r
        x1 = x[1] * r
        for i = 2:n - 1
            ci = c[i]
            l_ = ci / di
            di = a[i] - ci * l_
            xi = b[i] - xi * l_
            f *= -l_
            l[i] = f
            r = f / di
            s1 += f * r
            x1 += xi * r
            x[i] = xi
            d[i] = di
        end
        l_ = c[n] / d[n - 1]
        d[n] = a[n] - c[n] * l_
        x[n] = b[n] - x[n - 1] * l_
        fp = - r * c[n]
    end
    s1, x1, fp, d[n], x[n]
end

and I get that this is approximately 15 pct. faster. (The 30 pct. I mentioned earlier wasn't fair as I compared to a version with in-place updates and therefore only thee vectors to traverse through.) The assembly from lsolve1 shows that only a single of the memory loads is optimized away.

Again, I wrote C implementations to see what Clang could make out of it. With -O1 it appears that Clang eliminates a single load and with -O3 two loads are eliminated, but this gives more than 30 pct. speedup (so maybe other things have also changed.) However, the hand optimized C version of lsolve2 complied with -O1 is still 20 pct faster than lsolve1 with -O3.

@ArchRobison In the Julia implementation, I tried to use a while loop similar to the one you used, but I couldn't get it to optimize more loads away. Maybe because this example is looping forwards and therefore uses UnitRange and not a StepRange.

Assembly of loop

Julia lsolve1 (seven memory loads)

L193:   lea R14, QWORD PTR [8*R10]
Source line: 75
    mov RAX, RDX
    sub RAX, R14
Source line: 81
    mov R15, QWORD PTR [RBP - 48]
    sub R15, R14
Source line: 74
    mov RDI, R11
    sub RDI, R14
    vmovsd  XMM5, QWORD PTR [RDI + 8]
    vdivsd  XMM3, XMM5, QWORD PTR [R15]
Source line: 75
    vmovsd  XMM6, QWORD PTR [RAX + 8]
Source line: 81
    dec R10
    mov R12, R8
    sub R12, R14
Source line: 78
    mov R9, RCX
    sub R9, R14
Source line: 76
    mov RAX, R13
    sub RAX, R14
Source line: 81
    cmp RBX, R10
Source line: 75
    vmulsd  XMM5, XMM3, XMM5
    vsubsd  XMM5, XMM6, XMM5
    vmovsd  QWORD PTR [R15 + 8], XMM5
Source line: 76
    vmulsd  XMM5, XMM3, QWORD PTR [R12]
    vmovsd  XMM6, QWORD PTR [RAX + 8]
    vsubsd  XMM5, XMM6, XMM5
    vmovsd  QWORD PTR [R12 + 8], XMM5
Source line: 77
    vxorpd  XMM3, XMM3, XMM4
    vmulsd  XMM2, XMM2, XMM3
Source line: 78
    vmovsd  QWORD PTR [R9 + 8], XMM2
Source line: 79
    vdivsd  XMM3, XMM2, QWORD PTR [R15 + 8]
Source line: 80
    vmulsd  XMM5, XMM2, XMM3
    vaddsd  XMM1, XMM1, XMM5
Source line: 81
    vmulsd  XMM5, XMM3, QWORD PTR [R12 + 8]
    vaddsd  XMM0, XMM0, XMM5
    jne L193

Clang lsolve1 with -O1 (seven memory loads)

LBB0_2:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: lsolve:n <- RDI
    ##DEBUG_VALUE: lsolve:a <- R13
    ##DEBUG_VALUE: lsolve:b <- R12
    ##DEBUG_VALUE: lsolve:c <- RSI
    ##DEBUG_VALUE: lsolve:d <- RBX
    ##DEBUG_VALUE: lsolve:l <- R14
    ##DEBUG_VALUE: lsolve:x <- R15
    ##DEBUG_VALUE: lsolve:i <- 1
    .loc    1 23 0                  ## eka1.c:23:0
    movsd   xmm2, qword ptr [rsi + 8*rdx + 8]
    movaps  xmm4, xmm2
    divsd   xmm4, qword ptr [rbx + 8*rdx]
Ltmp25:
    ##DEBUG_VALUE: lsolve:l_ <- XMM4
    .loc    1 24 0                  ## eka1.c:24:0
    movsd   xmm5, qword ptr [r13 + 8*rdx + 8]
    mulsd   xmm2, xmm4
    subsd   xmm5, xmm2
    movsd   qword ptr [rbx + 8*rdx + 8], xmm5
    .loc    1 25 0                  ## eka1.c:25:0
    movsd   xmm2, qword ptr [r12 + 8*rdx + 8]
    .loc    1 26 0                  ## eka1.c:26:0
    mulsd   xmm1, xmm4
    .loc    1 25 0                  ## eka1.c:25:0
    mulsd   xmm4, qword ptr [r15 + 8*rdx]
Ltmp26:
    subsd   xmm2, xmm4
    movsd   qword ptr [r15 + 8*rdx + 8], xmm2
    .loc    1 26 0                  ## eka1.c:26:0
    xorpd   xmm1, xmm3
Ltmp27:
    ##DEBUG_VALUE: lsolve:f <- XMM1
    .loc    1 27 0                  ## eka1.c:27:0
    movsd   qword ptr [r14 + 8*rdx + 8], xmm1
    .loc    1 28 0                  ## eka1.c:28:0
    movapd  xmm2, xmm1
    divsd   xmm2, qword ptr [rbx + 8*rdx + 8]
Ltmp28:
    ##DEBUG_VALUE: lsolve:r <- XMM2
    .loc    1 29 0                  ## eka1.c:29:0
    pshufd  xmm4, xmm2, 68          ## xmm4 = xmm2[0,1,0,1]
    movapd  xmm5, xmm1
    movhpd  xmm5, qword ptr [r15 + 8*rdx + 8]
    mulpd   xmm5, xmm4
    addpd   xmm0, xmm5
Ltmp29:
    .loc    1 21 0                  ## eka1.c:21:0
    inc rdx
    cmp rcx, rdx
    jne LBB0_2

Clang lsolve1 with -O3 (six memory loads)

LBB0_2:                                 ## %.lr.ph
                                        ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: lsolve:n <- RDI
    ##DEBUG_VALUE: lsolve:a <- R13
    ##DEBUG_VALUE: lsolve:b <- R12
    ##DEBUG_VALUE: lsolve:c <- RSI
    ##DEBUG_VALUE: lsolve:d <- RBX
    ##DEBUG_VALUE: lsolve:l <- R14
    ##DEBUG_VALUE: lsolve:x <- R15
    ##DEBUG_VALUE: lsolve:i <- 1
    movsd   xmm2, qword ptr [rsi + 8*rdx + 8]
    movaps  xmm5, xmm2
    divsd   xmm5, xmm4
Ltmp25:
    ##DEBUG_VALUE: lsolve:l_ <- XMM5
    .loc    1 24 0                  ## eka1.c:24:0
    movsd   xmm4, qword ptr [r13 + 8*rdx + 8]
    mulsd   xmm2, xmm5
    subsd   xmm4, xmm2
    movsd   qword ptr [rbx + 8*rdx + 8], xmm4
    .loc    1 25 0                  ## eka1.c:25:0
    movsd   xmm2, qword ptr [r12 + 8*rdx + 8]
    .loc    1 26 0                  ## eka1.c:26:0
    mulsd   xmm1, xmm5
    .loc    1 25 0                  ## eka1.c:25:0
    mulsd   xmm5, qword ptr [r15 + 8*rdx]
Ltmp26:
    subsd   xmm2, xmm5
    movsd   qword ptr [r15 + 8*rdx + 8], xmm2
    .loc    1 26 0                  ## eka1.c:26:0
    xorpd   xmm1, xmm3
Ltmp27:
    ##DEBUG_VALUE: lsolve:f <- XMM1
    .loc    1 27 0                  ## eka1.c:27:0
    movsd   qword ptr [r14 + 8*rdx + 8], xmm1
    .loc    1 28 0                  ## eka1.c:28:0
    movsd   xmm4, qword ptr [rbx + 8*rdx + 8]
    movapd  xmm2, xmm1
    divsd   xmm2, xmm4
Ltmp28:
    ##DEBUG_VALUE: lsolve:r <- XMM2
    .loc    1 29 0                  ## eka1.c:29:0
    pshufd  xmm5, xmm2, 68          ## xmm5 = xmm2[0,1,0,1]
    movapd  xmm6, xmm1
    movhpd  xmm6, qword ptr [r15 + 8*rdx + 8]
    mulpd   xmm6, xmm5
    addpd   xmm0, xmm6
Ltmp29:
    .loc    1 21 0                  ## eka1.c:21:0
    inc rdx
    cmp rcx, rdx
    jne LBB0_2

Clang of lsolve2 with -O1 (three memory loads)

LBB0_2:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: lsolve:n <- RSI
    ##DEBUG_VALUE: lsolve:a <- RBX
    ##DEBUG_VALUE: lsolve:b <- R12
    ##DEBUG_VALUE: lsolve:c <- R15
    ##DEBUG_VALUE: lsolve:d <- R13
    ##DEBUG_VALUE: lsolve:l <- R14
    ##DEBUG_VALUE: lsolve:x <- RDI
    ##DEBUG_VALUE: lsolve:i <- 1
    .loc    1 25 0                  ## eka2.c:25:0
    movsd   xmm3, qword ptr [r15 + 8*rdx + 8]
Ltmp29:
    ##DEBUG_VALUE: lsolve:ci <- XMM3
    .loc    1 26 0                  ## eka2.c:26:0
    movaps  xmm5, xmm3
    divsd   xmm5, xmm1
Ltmp30:
    ##DEBUG_VALUE: lsolve:l_ <- XMM5
    .loc    1 27 0                  ## eka2.c:27:0
    movsd   xmm1, qword ptr [rbx + 8*rdx + 8]
    mulsd   xmm3, xmm5
Ltmp31:
    subsd   xmm1, xmm3
Ltmp32:
    ##DEBUG_VALUE: lsolve:di <- XMM1
    .loc    1 29 0                  ## eka2.c:29:0
    shufpd  xmm5, xmm5, 0           ## xmm5 = xmm5[0,0]
Ltmp33:
    mulpd   xmm5, xmm2
    movapd  xmm2, xmm4
    movhpd  xmm2, qword ptr [r12 + 8*rdx + 8]
    subpd   xmm2, xmm5
    .loc    1 30 0                  ## eka2.c:30:0
    movlpd  qword ptr [r14 + 8*rdx + 8], xmm2
    .loc    1 31 0                  ## eka2.c:31:0
    movapd  xmm3, xmm2
    divsd   xmm3, xmm1
Ltmp34:
    ##DEBUG_VALUE: lsolve:r <- XMM3
    .loc    1 32 0                  ## eka2.c:32:0
    pshufd  xmm5, xmm3, 68          ## xmm5 = xmm3[0,1,0,1]
    mulpd   xmm5, xmm2
    addpd   xmm0, xmm5
    .loc    1 34 0                  ## eka2.c:34:0
    movsd   qword ptr [r13 + 8*rdx + 8], xmm1
    .loc    1 35 0                  ## eka2.c:35:0
    movhpd  qword ptr [rdi + 8*rdx + 8], xmm2
Ltmp35:
    .loc    1 23 0                  ## eka2.c:23:0
    inc rdx
    cmp rcx, rdx
    jne LBB0_2

@simonster
Copy link
Member

@andreasnoack's example looks like pointer aliasing. There are no stores between the loads of c[i] or between the load of d[i-1] late in the loop and the load of d[i] at the top, so I believe those are the only loads that can be removed. It looks like both Julia and clang remove the former, but only clang -O3 removes the latter. You could try using the restrict keyword and see if clang can optimize away the rest of the loads.

It also looks like we're doing more work to compute memory addresses than clang (all the movs and subs, one of which appears to contribute an additional load because we spilled to the stack). That may be worth investigating.

@andreasnoack
Copy link
Member

@simonster Thanks and yes, you are right about the aliasing. With restrict and -O3 Clang is able to optimize down to three loads. That is actually more restrictive than necessary because the algorithm is valid even when x=b, d=a, and l=c, but that is hard to annotate in the program.

Even with the extra loads optimized away, the hand optimized version compiled with -O1 is almost 20 pct. faster than lsolve1 with restrict and -I3. I can't say why.

Clang lsolve1 with restrict and -O3

LBB0_2:                                 ## %.lr.ph
                                        ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: lsolve:n <- RSI
    ##DEBUG_VALUE: lsolve:a <- R13
    ##DEBUG_VALUE: lsolve:b <- R15
    ##DEBUG_VALUE: lsolve:c <- RBX
    ##DEBUG_VALUE: lsolve:d <- R12
    ##DEBUG_VALUE: lsolve:l <- R14
    ##DEBUG_VALUE: lsolve:x <- RDI
    ##DEBUG_VALUE: lsolve:i <- 1
    movsd   xmm6, qword ptr [rbx + 8*rdx + 8]
    movaps  xmm5, xmm6
    divsd   xmm5, xmm1
Ltmp25:
    ##DEBUG_VALUE: lsolve:l_ <- XMM5
    .loc    1 24 0                  ## eka1.c:24:0
    movsd   xmm1, qword ptr [r13 + 8*rdx + 8]
    mulsd   xmm6, xmm5
    subsd   xmm1, xmm6
    movsd   qword ptr [r12 + 8*rdx + 8], xmm1
    .loc    1 25 0                  ## eka1.c:25:0
    movsd   xmm6, qword ptr [r15 + 8*rdx + 8]
    mulsd   xmm4, xmm5
    subsd   xmm6, xmm4
    movsd   qword ptr [rdi + 8*rdx + 8], xmm6
    .loc    1 26 0                  ## eka1.c:26:0
    mulsd   xmm5, xmm3
Ltmp26:
    xorpd   xmm5, xmm2
Ltmp27:
    ##DEBUG_VALUE: lsolve:f <- XMM5
    .loc    1 27 0                  ## eka1.c:27:0
    movsd   qword ptr [r14 + 8*rdx + 8], xmm5
    .loc    1 29 0                  ## eka1.c:29:0
    movapd  xmm4, xmm5
    movapd  xmm3, xmm5
Ltmp28:
    ##DEBUG_VALUE: lsolve:f <- XMM3
    .loc    1 28 0                  ## eka1.c:28:0
    divsd   xmm5, xmm1
Ltmp29:
    ##DEBUG_VALUE: lsolve:r <- XMM5
    .loc    1 29 0                  ## eka1.c:29:0
    unpcklpd    xmm4, xmm6      ## xmm4 = xmm4[0],xmm6[0]
    pshufd  xmm7, xmm5, 68          ## xmm7 = xmm5[0,1,0,1]
    mulpd   xmm7, xmm4
    addpd   xmm0, xmm7
Ltmp30:
    .loc    1 21 0                  ## eka1.c:21:0
    inc rdx
    cmp rcx, rdx
    .loc    1 23 0                  ## eka1.c:23:0
Ltmp31:
    movapd  xmm4, xmm6
Ltmp32:
    .loc    1 21 0                  ## eka1.c:21:0
    jne LBB0_2

@ArchRobison
Copy link
Contributor

Seems like we need the equivalent of restrict. For a dynamic language, a run-time assertion that says "these two arrays are disjoint" seem like the right thing.

@JeffBezanson
Copy link
Member

@ArchRobison that's a good observation about the StepRange constructor. I'll do that refactoring.

@StefanKarpinski
Copy link
Member

@ArchRobison – we can currently only have arrays that are identical or disjoint, which may make things easier if we can figure out a way to take advantage of it. One option would be to generate code for both cases and branch on a test at the top. Not sure what the best way to do that would be.

@JeffBezanson
Copy link
Member

With the StepRange constructor changed to make it inlineable, plus @inbounds, the example at the top gets optimized.

@simonster
Copy link
Member

@andreasnoack It may be worth benchmarking against julia with LLVM 3.6. It looks like LLVM 3.6 can optimize away the extraneous index computation instructions, and with julia -O it can also partially vectorize some operations (which clang is already doing).

@andreasnoack
Copy link
Member

@simonster Thanks for reporting this. I've just upgraded to LLVM 3.6, but @code_native seems broken. I get

julia> @code_native lsolve1(randn(2), randn(2), randn(2), randn(2), randn(2), randn(2))
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
    .section    __TEXT,__text,regular,pure_instructions
    push    rbp
    push    rbp
    push    rbp
    push    rbp
    push    rbp

and it continues. This is on Mac. Are you using Linux?

@simonster
Copy link
Member

code_native works for me on Mac, but didn't before 1d0b067

@andreasnoack
Copy link
Member

Great. Now it works, but the assembly syntax has changed. We used to load from right to left, but now it seems to be left to right.

I can still count seven memory loads in the assembly.

@simonster
Copy link
Member

With julia -O (which enables the SLP vectorizer) I get 6 loads and the code looks very similar to the clang -O3 code without restrict. Beating clang with the same algorithm would be awesome, but at present we can really only hope to match it.

@andreasnoack
Copy link
Member

Great. I have just tried with latest master, LLVM 3.6 and julia -O and Julia is indeed able to optimize the loads similarly to clang -O3 now.

@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 10, 2016
jrevels added a commit to JuliaCI/BaseBenchmarks.jl that referenced this issue Jan 25, 2016
jrevels added a commit to JuliaCI/BaseBenchmarks.jl that referenced this issue Jan 25, 2016
@jrevels
Copy link
Member

jrevels commented Jan 25, 2016

I've just added some benchmarks based on this issue to BaseBenchmarks.jl. It seems that #14623 does not solve the original issue here (though, FWIW, I'm not sure what level of optimization we should expect to achieve before closing this issue).

With x = rand(10^8), and c = rand(10^8) (fresh arrays were used each time):

# with -O flag
@time perf_rev_load_slow!(x);       # 0.205904 seconds (4 allocations: 160 bytes)
@time perf_rev_load_fast!(x);       # 0.089029 seconds (4 allocations: 160 bytes)
@time perf_rev_loadmul_slow!(x, c); # 0.369422 seconds (4 allocations: 160 bytes)
@time perf_rev_loadmul_fast!(x, c); # 0.180176 seconds (4 allocations: 160 bytes)
# without -O flag
@time perf_rev_load_slow!(x);       # 0.210275 seconds (4 allocations: 160 bytes)
@time perf_rev_load_fast!(x);       # 0.109113 seconds (4 allocations: 160 bytes)
@time perf_rev_loadmul_slow!(x, c); # 0.380331 seconds (4 allocations: 160 bytes)
@time perf_rev_loadmul_fast!(x, c); # 0.193872 seconds (4 allocations: 160 bytes)
Julia Version 0.5.0-dev+2237
Commit b9815f3 (2016-01-21 15:32 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1

@jrevels jrevels removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 25, 2016
@KristofferC
Copy link
Member

Adding @inbounds to the loop we now have

julia> @btime optimized(c, x)
  25.954 ms (0 allocations: 0 bytes)

julia> @btime unoptimized(c, x)
  26.116 ms (0 allocations: 0 bytes)

I believe that can close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests