-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance problem with memory loads/stores #10301
Comments
Acknowledged. Your guess about the generated code is probably right. This seems like a very difficult optimization to obtain. @ArchRobison any idea how realistic this is? |
We realized this while working on a slightly more complicated problem. In a C implementation of the more complicated problem complied with Clang, the extra load from memory was optimized away whenever we used an optimization level greater than one. Therefore, we saw the same difference as @epn mentioned above whenever we compiled with Unfortunately, Clang optimizes the examples in this issue slightly differently from the more complicated problem and Clang is actually not able to remove the extra load from memory here. Instead the function is inlined and the speed gain is the same. In contrast Assembly inner loopsJulia
|
Interesting; perhaps there is another LLVM pass we can add in |
LLVM Polly comes to mind. One of its principal developers is interested in helping bring it to Julia, see https://groups.google.com/forum/?fromgroups=#!searchin/julia-dev/polly/julia-dev/lNPeqD6uZrQ/wv3TKzYutQsJ. It might just need someone to help him write a C version of |
@timholy Clang is able to do this without Polly, so there is some pass (or ordering of passes) compared to Clang that enables this optimization (seems to kick in at -O2). |
Julia 0.3.5 does this optimization just fine. All you have to do is write the loop the right way 😄 Regrettably this is a sign that we have a more widespread performance bug 😦 . Here LLVM output for the loop, if written the right way. There are only 2 loads per iteration.
Here is the code that generated it:
The construct in the original example that is causing the problem is the range |
Fully generic StepRanges are unfortunately really complicated to handle. A lot of the logic is for handling overflow, so it can't be specialized away without knowing actual values. The quickest way would be some kind of |
Question: is there a way to dump the LLVM code for the constructor After looking at the constructor for I'm wondering about creating interprocedural analysis that could spot that Allowing Julia-level rewrite rules, e.g. |
Okay, here is the slightly more complicated code to make sure that the discussion here covers the issue that started this. I don't really know what level of optimization to expect from Julia. The results below might suggest that we can improve slightly because Clang's Here is the code function lsolve1(a, b, c, d, l, x)
@inbounds begin
n = size(x,1)
d[1] = a[1]
x[1] = b[1]
f = c[1]
l[1] = f
r = f / d[1]
s1 = f * r
x1 = x[1] * r
for i = 2:n - 1
l_ = c[i] / d[i - 1]
d[i] = a[i] - c[i] * l_
x[i] = b[i] - x[i - 1] * l_
f *= -l_
l[i] = f
r = f / d[i]
s1 += f * r
x1 += x[i] * r
end
l_ = c[n] / d[n - 1]
d[n] = a[n] - c[n] * l_
x[n] = b[n] - x[n - 1] * l_
fp = - r * c[n]
end
s1, x1, fp, d[n], x[n]
end where all arguments are function lsolve2(a, b, c, d, l, x)
@inbounds begin
n = size(x,1)
di = a[1]
d[1] = di
xi = b[1]
x[1] = xi;
f = c[1]
l[1] = f
r = f / di
s1 = f * r
x1 = x[1] * r
for i = 2:n - 1
ci = c[i]
l_ = ci / di
di = a[i] - ci * l_
xi = b[i] - xi * l_
f *= -l_
l[i] = f
r = f / di
s1 += f * r
x1 += xi * r
x[i] = xi
d[i] = di
end
l_ = c[n] / d[n - 1]
d[n] = a[n] - c[n] * l_
x[n] = b[n] - x[n - 1] * l_
fp = - r * c[n]
end
s1, x1, fp, d[n], x[n]
end and I get that this is approximately 15 pct. faster. (The 30 pct. I mentioned earlier wasn't fair as I compared to a version with in-place updates and therefore only thee vectors to traverse through.) The assembly from Again, I wrote C implementations to see what Clang could make out of it. With @ArchRobison In the Julia implementation, I tried to use a Assembly of loopJulia
|
@andreasnoack's example looks like pointer aliasing. There are no stores between the loads of It also looks like we're doing more work to compute memory addresses than clang (all the |
@simonster Thanks and yes, you are right about the aliasing. With Even with the extra loads optimized away, the hand optimized version compiled with Clang
|
Seems like we need the equivalent of restrict. For a dynamic language, a run-time assertion that says "these two arrays are disjoint" seem like the right thing. |
@ArchRobison that's a good observation about the StepRange constructor. I'll do that refactoring. |
@ArchRobison – we can currently only have arrays that are identical or disjoint, which may make things easier if we can figure out a way to take advantage of it. One option would be to generate code for both cases and branch on a test at the top. Not sure what the best way to do that would be. |
With the StepRange constructor changed to make it inlineable, plus |
@andreasnoack It may be worth benchmarking against julia with LLVM 3.6. It looks like LLVM 3.6 can optimize away the extraneous index computation instructions, and with |
@simonster Thanks for reporting this. I've just upgraded to LLVM 3.6, but julia> @code_native lsolve1(randn(2), randn(2), randn(2), randn(2), randn(2), randn(2))
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
error: failed to compute relocation: X86_64_RELOC_UNSIGNED
.section __TEXT,__text,regular,pure_instructions
push rbp
push rbp
push rbp
push rbp
push rbp and it continues. This is on Mac. Are you using Linux? |
|
Great. Now it works, but the assembly syntax has changed. We used to load from right to left, but now it seems to be left to right. I can still count seven memory loads in the assembly. |
With |
Great. I have just tried with latest master, LLVM 3.6 and |
I've just added some benchmarks based on this issue to BaseBenchmarks.jl. It seems that #14623 does not solve the original issue here (though, FWIW, I'm not sure what level of optimization we should expect to achieve before closing this issue). With # with -O flag
@time perf_rev_load_slow!(x); # 0.205904 seconds (4 allocations: 160 bytes)
@time perf_rev_load_fast!(x); # 0.089029 seconds (4 allocations: 160 bytes)
@time perf_rev_loadmul_slow!(x, c); # 0.369422 seconds (4 allocations: 160 bytes)
@time perf_rev_loadmul_fast!(x, c); # 0.180176 seconds (4 allocations: 160 bytes) # without -O flag
@time perf_rev_load_slow!(x); # 0.210275 seconds (4 allocations: 160 bytes)
@time perf_rev_load_fast!(x); # 0.109113 seconds (4 allocations: 160 bytes)
@time perf_rev_loadmul_slow!(x, c); # 0.380331 seconds (4 allocations: 160 bytes)
@time perf_rev_loadmul_fast!(x, c); # 0.193872 seconds (4 allocations: 160 bytes) Julia Version 0.5.0-dev+2237
Commit b9815f3 (2016-01-21 15:32 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 |
Adding
I believe that can close the issue. |
Julia has the following performance problem with array accesses.
Consider the following code snippets.
Both functions do the same computation.
There is a loop-carry dependency in this code, that is, x [i] depends on the value of x [i+1].
While "optimized" forces reading x [i+1] from the local variable x_, "unoptimized" simply accesses x [i+1] .
For n = 100 million, and for 100 runs of each function on my macbook air, i have the following runtimes.
optimized : min 0.307929443 max 0.448113027 mean 0.31504216288000003 variance 0.00036038772082227653
unoptimized : min 0.508484855 max 0.674296101 mean 0.54576100111 variance 0.001543397115546402
Similar runtimes were obtained from using julia -O.
I haven't looked at the assembly code, but it seems that the unoptimized version loads x[i+1] from memory though x [i+1] was just computed in the previous iteration.
The text was updated successfully, but these errors were encountered: