Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate array dimensions from allocations #43487

Conversation

pchintalapudi
Copy link
Member

@pchintalapudi pchintalapudi commented Dec 20, 2021

2D and 3D array dimension lengths cannot change, and thus we can replace loads of array length and array dimension size with their values from when the array was allocated. 1D array dimensions might change, but we can determine if they might using escape analysis.

Currently, the mechanism for identifying array allocations does not work on Windows, due to the insertion of a trampoline function on Windows, which prevents identification by function pointer equivalence from working as intended (this data was collected from #43107). Fixing this for Windows would require overhauling how we handle ccalls during codegen, which is not part of this PR.

Identification of arrays is now done through metadata set during codegen, which should theoretically get around the Windows issue.

Depends on #43057 for creating the llvm-alloc-helpers.cpp file

Performance benchmark:

Function:

function f()
           s = 0
           a = zeros(Int, 100, 100, 100)
           for i in eachindex(a)
               s += a[i]
           end
           s
       end

Master:
Godbolt: https://godbolt.org/z/cj7h3e3Gr
Benchmark:

julia> @benchmark f()
BenchmarkTools.Trial: 3179 samples with 1 evaluation.
 Range (min … max):  917.762 μs …  13.191 ms  ┊ GC (min … max):  0.00% … 92.57%
 Time  (median):       1.413 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.566 ms ± 802.684 μs  ┊ GC (mean ± σ):  10.83% ± 15.99%

  ▅▆█▄        ▂▇▆▃                  ▂▂▃▂  ▃▂    ▁▁  ▂   ▁▃▂   ▂ ▁
  ████▅▅▇█▄▁▁▁████▅▄▆▆▅▁▁▃▄▄▃▁▁▁▁▁▃█████▇███▇▄▄███▇███▅▆███▇▄██ █
  918 μs        Histogram: log(frequency) by time       3.24 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

PR:
Godbolt: https://godbolt.org/z/zKWz8dxev
Benchmark:

julia> @benchmark f()
BenchmarkTools.Trial: 6350 samples with 1 evaluation.
 Range (min … max):  687.484 μs …   4.911 ms  ┊ GC (min … max): 0.00% …  0.00%
 Time  (median):     702.864 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   782.345 μs ± 188.419 μs  ┊ GC (mean ± σ):  9.65% ± 14.78%

   █
  ▅█▇▃▄▂▃▃▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▂▂▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▄▃▃▂▂▂ ▂
  687 μs           Histogram: frequency by time         1.17 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

@pchintalapudi
Copy link
Member Author

Since my branches are part of a fork, I've created a PR comparing this one and the allocation hoisting PR immediately before it here: pchintalapudi#4

@pchintalapudi pchintalapudi added arrays [a, r, r, a, y, s] compiler:codegen Generation of LLVM IR and native code performance Must go faster labels Dec 29, 2021
@pchintalapudi pchintalapudi force-pushed the pc/multi-d-array-propagation branch 2 times, most recently from a351c5e to 6e3b743 Compare January 7, 2022 06:45
@pchintalapudi
Copy link
Member Author

Now that #43057 has been merged, this PR is ready to be reviewed for rebasing onto the master branch.

@pchintalapudi
Copy link
Member Author

Compile time benchmarks:

Master:

Base  ───────────── 23.796904 seconds
ArgTools  ─────────  4.456852 seconds
Artifacts  ────────  0.102586 seconds
Base64  ───────────  0.105645 seconds
CRC32c  ───────────  0.008005 seconds
FileWatching  ─────  0.108983 seconds
Libdl  ────────────  0.001518 seconds
Logging  ──────────  0.036208 seconds
Mmap  ─────────────  0.085615 seconds
NetworkOptions  ───  0.117357 seconds
SHA  ──────────────  0.181987 seconds
Serialization  ────  0.270087 seconds
Sockets  ──────────  0.321430 seconds
Unicode  ──────────  0.077732 seconds
DelimitedFiles  ───  0.104025 seconds
LinearAlgebra  ────  7.762580 seconds
Markdown  ─────────  0.754207 seconds
Printf  ───────────  0.168311 seconds
Random  ───────────  1.242392 seconds
Tar  ──────────────  0.282602 seconds
Dates  ────────────  1.536529 seconds
Distributed  ──────  0.764613 seconds
Future  ───────────  0.005194 seconds
InteractiveUtils  ─  0.500062 seconds
LibGit2  ──────────  1.366151 seconds
Profile  ──────────  0.257939 seconds
SparseArrays  ─────  2.678179 seconds
UUIDs  ────────────  0.015961 seconds
REPL  ─────────────  3.456523 seconds
SharedArrays  ─────  0.500804 seconds
Statistics  ───────  0.188637 seconds
SuiteSparse  ──────  1.582334 seconds
TOML  ─────────────  0.067595 seconds
Test  ─────────────  0.315075 seconds
LibCURL  ──────────  0.438543 seconds
Downloads  ────────  0.335553 seconds
Pkg  ──────────────  4.502768 seconds
LazyArtifacts  ────  0.001633 seconds
Stdlibs total  ──── 34.712985 seconds
Sysimage built. Summary:
Total ───────  58.511190 seconds
Base: ───────  23.796904 seconds 40.6707%
Stdlibs: ────  34.712985 seconds 59.3271%
    JULIA usr/lib/julia/sys-o.a
Generating REPL precompile statements... 36/36
Executing precompile statements... 1308/1339
Precompilation complete. Summary:
Total ─────── 109.996850 seconds
Generation ──  82.400907 seconds 74.9121%
Execution ───  27.595943 seconds 25.0879%

PR:

Base  ───────────── 23.928553 seconds
ArgTools  ─────────  4.422442 seconds
Artifacts  ────────  0.103806 seconds
Base64  ───────────  0.106303 seconds
CRC32c  ───────────  0.008324 seconds
FileWatching  ─────  0.110224 seconds
Libdl  ────────────  0.001559 seconds
Logging  ──────────  0.036711 seconds
Mmap  ─────────────  0.086969 seconds
NetworkOptions  ───  0.117836 seconds
SHA  ──────────────  0.184373 seconds
Serialization  ────  0.271228 seconds
Sockets  ──────────  0.322758 seconds
Unicode  ──────────  0.078247 seconds
DelimitedFiles  ───  0.105199 seconds
LinearAlgebra  ────  7.820788 seconds
Markdown  ─────────  0.758388 seconds
Printf  ───────────  0.151605 seconds
Random  ───────────  1.259426 seconds
Tar  ──────────────  0.285620 seconds
Dates  ────────────  1.510449 seconds
Distributed  ──────  0.776262 seconds
Future  ───────────  0.005317 seconds
InteractiveUtils  ─  0.503558 seconds
LibGit2  ──────────  1.377215 seconds
Profile  ──────────  0.258898 seconds
SparseArrays  ─────  2.681444 seconds
UUIDs  ────────────  0.016312 seconds
REPL  ─────────────  3.456168 seconds
SharedArrays  ─────  0.501174 seconds
Statistics  ───────  0.191785 seconds
SuiteSparse  ──────  1.591476 seconds
TOML  ─────────────  0.068235 seconds
Test  ─────────────  0.320033 seconds
LibCURL  ──────────  0.441098 seconds
Downloads  ────────  0.336550 seconds
Pkg  ──────────────  4.571783 seconds
LazyArtifacts  ────  0.001639 seconds
Stdlibs total  ──── 34.851874 seconds
Sysimage built. Summary:
Total ───────  58.781741 seconds
Base: ───────  23.928553 seconds 40.7075%
Stdlibs: ────  34.851874 seconds 59.2903%
    JULIA usr/lib/julia/sys-o.a
Generating REPL precompile statements... 36/36
Executing precompile statements... 1308/1339
Precompilation complete. Summary:
Total ─────── 109.750784 seconds
Generation ──  81.986599 seconds 74.7025%
Execution ───  27.764185 seconds 25.2975%

@pchintalapudi
Copy link
Member Author

Other notable examples of optimization improvement:


Example 2: doc/src/manual/performance-tips.md:952-963 loopinc

Master:

Godbolt: https://godbolt.org/z/oT9Wa4784

PR:

Godbolt: https://godbolt.org/z/r1or75rzK

Example 3: doc/src/manual/performance-tips.md:969-984 loopinc_prealloc

Master:

Godbolt: https://godbolt.org/z/7Ed4xd3o9

PR:

Godbolt: https://godbolt.org/z/G6adnc69f


Both examples show complete removal of the bounds check due to length propagation, and LLVM then recognizes that all of the loads from the array can be forwarded and eliminates them all. When combined with #43548 and its ability to remove arrays that are never loaded from, and if xinc(!) is not marked noinline, LLVM will either vectorize loopinc (resulting in a massive speedup) or outright precompute the sum in loopinc_prealloc (another O(n) -> O(1) transformation)

@vchuravy
Copy link
Member

vchuravy commented Jan 8, 2022

@nanosoldier runtests(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@vchuravy
Copy link
Member

@nanosoldier runtests(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@vchuravy
Copy link
Member

CI failed with:

julia: /buildworker/worker/package_linuxaarch64/build/src/llvm-array-opt.cpp:253: bool {anonymous}::Optimizer::optimizeCmpInsts(): Assertion `inst->getOperand(!idx) == inst && "Expected instruction's operands to include dim!"' failed.
signal (6): Aborted
in expression starting at none:0
gsignal at /lib/aarch64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/aarch64-linux-gnu/libc.so.6 (unknown line)
Allocations: 173605932 (Pool: 173531052; Big: 74880); GC: 143
Aborted (core dumped)

@vchuravy
Copy link
Member

@nanosoldier runtests(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@pchintalapudi pchintalapudi force-pushed the pc/multi-d-array-propagation branch from 4a18af9 to cc79187 Compare January 14, 2022 01:31
@vchuravy
Copy link
Member

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@vchuravy
Copy link
Member

@nanosoldier runtests(ALL, vs = ":master")

@vchuravy
Copy link
Member

If nanosoldier comes back clean I am going to merge this.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@vtjnash
Copy link
Member

vtjnash commented Jan 14, 2022

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@vtjnash
Copy link
Member

vtjnash commented Jan 14, 2022

I am not certain if we want to move forward with tricks like this. The goal for Array was to replace it with something simpler implemented in Julia, which would hopefully allow the compiler to "see" this, without needing to make it a special case again in codegen (we had some code like this many years ago).

@vchuravy
Copy link
Member

The goal for Array was to replace it with something simpler implemented in Julia, which would hopefully allow the compiler to "see" this

I think both Prem and I agree with that notion. But it is also the case that the move from Array to Buffer hasn't happened yet (and we are adding things like ImmutableArrays). So it might be a case of the perfect is the enemy of the good.

@pchintalapudi
Copy link
Member Author

I think there's another point to consider, which is that if Array is implemented in Julia as something like:

struct Array
    ptr::Buffer
    dims::Tuple
end

we cannot stack-allocate the internal buffer of the array even with interprocedural escape analysis. The reasoning behind that is linked to the possible reshape method below:

function reshape(a::Array, dims::Tuple)
    reshaped = Array()
    reshaped.dims = copy(dims)
    reshaped.ptr = a.ptr
    reshaped
end

Here, the lifetime of a.ptr outlives the lifetime of a itself. If reshape is called from another method and not inlined, then the other method could not stack allocate a.ptr unless it knew that a.ptr wasn't captured in a way like that. Right now this is not an issue because whenever we extend the lifetime of the internal data pointer beyond the outer, we also perform a ccall and so the outer object is also escaped. If we don't need a ccall to extend that pointer's lifetime, to fully stack allocate arrays we would need to know if a method was capturing the outer object as well as if the method was capturing a pointer inside the outer object, which could become infeasible with larger objects or larger numbers of arguments. The only way I see around this would be to make the Array type completely opaque to Julia code; that is, no Julia code can see or modify the internal ptr field. Unfortunately, I suspect this brings us back to Array being special-cased again, because with an opaque type the Julia compiler couldn't see into the type to make constant propagation happen.

@vtjnash
Copy link
Member

vtjnash commented Jan 14, 2022

That is not really a new point, but is already a problem: just replace "struct Array" with any existing type, such as SubArray, Ref, Tuple, etc., and you have provided a rough definition for escape analysis already.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@pchintalapudi
Copy link
Member Author

I think the crux of stack allocating the buffer data relies on the lifetime of the data pointer always being <= the lifetime of the containing object. If we had a way of communicating that sort of relationship within the language to the compiler, we could extend stack allocation to subfields of generic objects as well. That seems fairly risky though, as there isn't really a clear way to warn someone when they're about to violate that assumption, and it's really really easy to violate that assumption.

Of course, this is also a pretty niche case that relies on a lot of different conditions and analyses being present (we don't have either stack allocation for arrays yet or IPO escape analysis, although I believe both are coming soon), so it's hard to say whether such an optimization is even worth the effort yet. That being said, as Valentin says this might be more worth it as a optimization now while we don't have a full-Julia implementation of Array, especially since basic array stack allocation is 4-5 PRs down the line from this one (#43573) and depends on the array identification machinery proposed here. In the future, we might end up purging the arraysize/arraylen codegen functions regardless if it's entirely specified in Julia, so we can drop the codegen modifications as they get deleted.

@PallHaraldsson
Copy link
Contributor

This has been approved by 1/4 reviewers. Is it ok to merge? Are the "3 failing checks" false alarms? This seems like important work blocking very important stack-allocation work PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] compiler:codegen Generation of LLVM IR and native code performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants