Propagate array dimensions from allocations #43487

pchintalapudi · 2021-12-20T06:54:06Z

2D and 3D array dimension lengths cannot change, and thus we can replace loads of array length and array dimension size with their values from when the array was allocated. 1D array dimensions might change, but we can determine if they might using escape analysis.

Currently, the mechanism for identifying array allocations does not work on Windows, due to the insertion of a trampoline function on Windows, which prevents identification by function pointer equivalence from working as intended (this data was collected from #43107). Fixing this for Windows would require overhauling how we handle ccalls during codegen, which is not part of this PR.

Identification of arrays is now done through metadata set during codegen, which should theoretically get around the Windows issue.

Depends on #43057 for creating the llvm-alloc-helpers.cpp file

Performance benchmark:

Function:

function f()
           s = 0
           a = zeros(Int, 100, 100, 100)
           for i in eachindex(a)
               s += a[i]
           end
           s
       end

Master:
Godbolt: https://godbolt.org/z/cj7h3e3Gr
Benchmark:

julia> @benchmark f()
BenchmarkTools.Trial: 3179 samples with 1 evaluation.
 Range (min … max):  917.762 μs …  13.191 ms  ┊ GC (min … max):  0.00% … 92.57%
 Time  (median):       1.413 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.566 ms ± 802.684 μs  ┊ GC (mean ± σ):  10.83% ± 15.99%

  ▅▆█▄        ▂▇▆▃                  ▂▂▃▂  ▃▂    ▁▁  ▂   ▁▃▂   ▂ ▁
  ████▅▅▇█▄▁▁▁████▅▄▆▆▅▁▁▃▄▄▃▁▁▁▁▁▃█████▇███▇▄▄███▇███▅▆███▇▄██ █
  918 μs        Histogram: log(frequency) by time       3.24 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

PR:
Godbolt: https://godbolt.org/z/zKWz8dxev
Benchmark:

julia> @benchmark f()
BenchmarkTools.Trial: 6350 samples with 1 evaluation.
 Range (min … max):  687.484 μs …   4.911 ms  ┊ GC (min … max): 0.00% …  0.00%
 Time  (median):     702.864 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   782.345 μs ± 188.419 μs  ┊ GC (mean ± σ):  9.65% ± 14.78%

   █
  ▅█▇▃▄▂▃▃▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▂▂▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▄▃▃▂▂▂ ▂
  687 μs           Histogram: frequency by time         1.17 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

pchintalapudi · 2021-12-26T17:42:40Z

Since my branches are part of a fork, I've created a PR comparing this one and the allocation hoisting PR immediately before it here: pchintalapudi#4

pchintalapudi · 2022-01-07T07:34:21Z

Now that #43057 has been merged, this PR is ready to be reviewed for rebasing onto the master branch.

pchintalapudi · 2022-01-07T08:00:24Z

Compile time benchmarks:

Master:

Base  ───────────── 23.796904 seconds
ArgTools  ─────────  4.456852 seconds
Artifacts  ────────  0.102586 seconds
Base64  ───────────  0.105645 seconds
CRC32c  ───────────  0.008005 seconds
FileWatching  ─────  0.108983 seconds
Libdl  ────────────  0.001518 seconds
Logging  ──────────  0.036208 seconds
Mmap  ─────────────  0.085615 seconds
NetworkOptions  ───  0.117357 seconds
SHA  ──────────────  0.181987 seconds
Serialization  ────  0.270087 seconds
Sockets  ──────────  0.321430 seconds
Unicode  ──────────  0.077732 seconds
DelimitedFiles  ───  0.104025 seconds
LinearAlgebra  ────  7.762580 seconds
Markdown  ─────────  0.754207 seconds
Printf  ───────────  0.168311 seconds
Random  ───────────  1.242392 seconds
Tar  ──────────────  0.282602 seconds
Dates  ────────────  1.536529 seconds
Distributed  ──────  0.764613 seconds
Future  ───────────  0.005194 seconds
InteractiveUtils  ─  0.500062 seconds
LibGit2  ──────────  1.366151 seconds
Profile  ──────────  0.257939 seconds
SparseArrays  ─────  2.678179 seconds
UUIDs  ────────────  0.015961 seconds
REPL  ─────────────  3.456523 seconds
SharedArrays  ─────  0.500804 seconds
Statistics  ───────  0.188637 seconds
SuiteSparse  ──────  1.582334 seconds
TOML  ─────────────  0.067595 seconds
Test  ─────────────  0.315075 seconds
LibCURL  ──────────  0.438543 seconds
Downloads  ────────  0.335553 seconds
Pkg  ──────────────  4.502768 seconds
LazyArtifacts  ────  0.001633 seconds
Stdlibs total  ──── 34.712985 seconds
Sysimage built. Summary:
Total ───────  58.511190 seconds
Base: ───────  23.796904 seconds 40.6707%
Stdlibs: ────  34.712985 seconds 59.3271%
    JULIA usr/lib/julia/sys-o.a
Generating REPL precompile statements... 36/36
Executing precompile statements... 1308/1339
Precompilation complete. Summary:
Total ─────── 109.996850 seconds
Generation ──  82.400907 seconds 74.9121%
Execution ───  27.595943 seconds 25.0879%

PR:

Base  ───────────── 23.928553 seconds
ArgTools  ─────────  4.422442 seconds
Artifacts  ────────  0.103806 seconds
Base64  ───────────  0.106303 seconds
CRC32c  ───────────  0.008324 seconds
FileWatching  ─────  0.110224 seconds
Libdl  ────────────  0.001559 seconds
Logging  ──────────  0.036711 seconds
Mmap  ─────────────  0.086969 seconds
NetworkOptions  ───  0.117836 seconds
SHA  ──────────────  0.184373 seconds
Serialization  ────  0.271228 seconds
Sockets  ──────────  0.322758 seconds
Unicode  ──────────  0.078247 seconds
DelimitedFiles  ───  0.105199 seconds
LinearAlgebra  ────  7.820788 seconds
Markdown  ─────────  0.758388 seconds
Printf  ───────────  0.151605 seconds
Random  ───────────  1.259426 seconds
Tar  ──────────────  0.285620 seconds
Dates  ────────────  1.510449 seconds
Distributed  ──────  0.776262 seconds
Future  ───────────  0.005317 seconds
InteractiveUtils  ─  0.503558 seconds
LibGit2  ──────────  1.377215 seconds
Profile  ──────────  0.258898 seconds
SparseArrays  ─────  2.681444 seconds
UUIDs  ────────────  0.016312 seconds
REPL  ─────────────  3.456168 seconds
SharedArrays  ─────  0.501174 seconds
Statistics  ───────  0.191785 seconds
SuiteSparse  ──────  1.591476 seconds
TOML  ─────────────  0.068235 seconds
Test  ─────────────  0.320033 seconds
LibCURL  ──────────  0.441098 seconds
Downloads  ────────  0.336550 seconds
Pkg  ──────────────  4.571783 seconds
LazyArtifacts  ────  0.001639 seconds
Stdlibs total  ──── 34.851874 seconds
Sysimage built. Summary:
Total ───────  58.781741 seconds
Base: ───────  23.928553 seconds 40.7075%
Stdlibs: ────  34.851874 seconds 59.2903%
    JULIA usr/lib/julia/sys-o.a
Generating REPL precompile statements... 36/36
Executing precompile statements... 1308/1339
Precompilation complete. Summary:
Total ─────── 109.750784 seconds
Generation ──  81.986599 seconds 74.7025%
Execution ───  27.764185 seconds 25.2975%

pchintalapudi · 2022-01-08T01:05:49Z

Other notable examples of optimization improvement:

Example 2: doc/src/manual/performance-tips.md:952-963 loopinc

Master:

Godbolt: https://godbolt.org/z/oT9Wa4784

PR:

Godbolt: https://godbolt.org/z/r1or75rzK

Example 3: doc/src/manual/performance-tips.md:969-984 loopinc_prealloc

Master:

Godbolt: https://godbolt.org/z/7Ed4xd3o9

PR:

Godbolt: https://godbolt.org/z/G6adnc69f

Both examples show complete removal of the bounds check due to length propagation, and LLVM then recognizes that all of the loads from the array can be forwarded and eliminates them all. When combined with #43548 and its ability to remove arrays that are never loaded from, and if xinc(!) is not marked noinline, LLVM will either vectorize loopinc (resulting in a massive speedup) or outright precompute the sum in loopinc_prealloc (another O(n) -> O(1) transformation)

vchuravy · 2022-01-08T19:48:31Z

@nanosoldier runtests(ALL, vs = ":master")

nanosoldier · 2022-01-09T03:08:03Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

vchuravy · 2022-01-10T00:33:44Z

@nanosoldier runtests(ALL, vs = ":master")

src/ccall.cpp

nanosoldier · 2022-01-10T08:09:36Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

vchuravy · 2022-01-10T17:01:08Z

CI failed with:

julia: /buildworker/worker/package_linuxaarch64/build/src/llvm-array-opt.cpp:253: bool {anonymous}::Optimizer::optimizeCmpInsts(): Assertion `inst->getOperand(!idx) == inst && "Expected instruction's operands to include dim!"' failed.
signal (6): Aborted
in expression starting at none:0
gsignal at /lib/aarch64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/aarch64-linux-gnu/libc.so.6 (unknown line)
Allocations: 173605932 (Pool: 173531052; Big: 74880); GC: 143
Aborted (core dumped)

vchuravy · 2022-01-10T17:58:11Z

@nanosoldier runtests(ALL, vs = ":master")

nanosoldier · 2022-01-11T01:37:49Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

…rays

vchuravy · 2022-01-14T01:58:39Z

@nanosoldier runbenchmarks(!"scalar", vs=":master")

vchuravy · 2022-01-14T01:59:12Z

@nanosoldier runtests(ALL, vs = ":master")

vchuravy · 2022-01-14T01:59:27Z

If nanosoldier comes back clean I am going to merge this.

nanosoldier · 2022-01-14T07:33:38Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

nanosoldier · 2022-01-14T14:42:13Z

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

vtjnash · 2022-01-14T16:14:37Z

@nanosoldier runbenchmarks(!"scalar", vs=":master")

vtjnash · 2022-01-14T16:17:26Z

I am not certain if we want to move forward with tricks like this. The goal for Array was to replace it with something simpler implemented in Julia, which would hopefully allow the compiler to "see" this, without needing to make it a special case again in codegen (we had some code like this many years ago).

vchuravy · 2022-01-14T16:29:01Z

The goal for Array was to replace it with something simpler implemented in Julia, which would hopefully allow the compiler to "see" this

I think both Prem and I agree with that notion. But it is also the case that the move from Array to Buffer hasn't happened yet (and we are adding things like ImmutableArrays). So it might be a case of the perfect is the enemy of the good.

pchintalapudi · 2022-01-14T19:46:53Z

I think there's another point to consider, which is that if Array is implemented in Julia as something like:

struct Array
    ptr::Buffer
    dims::Tuple
end

we cannot stack-allocate the internal buffer of the array even with interprocedural escape analysis. The reasoning behind that is linked to the possible reshape method below:

function reshape(a::Array, dims::Tuple)
    reshaped = Array()
    reshaped.dims = copy(dims)
    reshaped.ptr = a.ptr
    reshaped
end

Here, the lifetime of a.ptr outlives the lifetime of a itself. If reshape is called from another method and not inlined, then the other method could not stack allocate a.ptr unless it knew that a.ptr wasn't captured in a way like that. Right now this is not an issue because whenever we extend the lifetime of the internal data pointer beyond the outer, we also perform a ccall and so the outer object is also escaped. If we don't need a ccall to extend that pointer's lifetime, to fully stack allocate arrays we would need to know if a method was capturing the outer object as well as if the method was capturing a pointer inside the outer object, which could become infeasible with larger objects or larger numbers of arguments. The only way I see around this would be to make the Array type completely opaque to Julia code; that is, no Julia code can see or modify the internal ptr field. Unfortunately, I suspect this brings us back to Array being special-cased again, because with an opaque type the Julia compiler couldn't see into the type to make constant propagation happen.

vtjnash · 2022-01-14T21:08:11Z

That is not really a new point, but is already a problem: just replace "struct Array" with any existing type, such as SubArray, Ref, Tuple, etc., and you have provided a rough definition for escape analysis already.

nanosoldier · 2022-01-14T21:50:07Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

pchintalapudi · 2022-01-14T22:42:45Z

I think the crux of stack allocating the buffer data relies on the lifetime of the data pointer always being <= the lifetime of the containing object. If we had a way of communicating that sort of relationship within the language to the compiler, we could extend stack allocation to subfields of generic objects as well. That seems fairly risky though, as there isn't really a clear way to warn someone when they're about to violate that assumption, and it's really really easy to violate that assumption.

Of course, this is also a pretty niche case that relies on a lot of different conditions and analyses being present (we don't have either stack allocation for arrays yet or IPO escape analysis, although I believe both are coming soon), so it's hard to say whether such an optimization is even worth the effort yet. That being said, as Valentin says this might be more worth it as a optimization now while we don't have a full-Julia implementation of Array, especially since basic array stack allocation is 4-5 PRs down the line from this one (#43573) and depends on the array identification machinery proposed here. In the future, we might end up purging the arraysize/arraylen codegen functions regardless if it's entirely specified in Julia, so we can drop the codegen modifications as they get deleted.

PallHaraldsson · 2022-11-03T15:38:08Z

This has been approved by 1/4 reviewers. Is it ok to merge? Are the "3 failing checks" false alarms? This seems like important work blocking very important stack-allocation work PR.

pchintalapudi changed the title ~~Propagate 2D and 3D array dimensions from allocations~~ Propagate array dimensions from allocations Dec 25, 2021

pchintalapudi added arrays [a, r, r, a, y, s] compiler:codegen Generation of LLVM IR and native code performance Must go faster labels Dec 29, 2021

jpsamaroo mentioned this pull request Jan 3, 2022

Index sets and constant propagation on Array shapes #43642

Open

pchintalapudi force-pushed the pc/multi-d-array-propagation branch 2 times, most recently from a351c5e to 6e3b743 Compare January 7, 2022 06:45

pchintalapudi mentioned this pull request Jan 7, 2022

Add noalias annotation for freshly allocated arrays #43475

Closed

pchintalapudi requested review from vtjnash and aviatesk January 8, 2022 05:03

pchintalapudi mentioned this pull request Jan 9, 2022

Move small arrays to the stack #43573

Closed

pchintalapudi force-pushed the pc/multi-d-array-propagation branch from 1f3cb02 to 5c5f559 Compare January 9, 2022 23:21

vchuravy reviewed Jan 10, 2022

View reviewed changes

src/ccall.cpp Show resolved Hide resolved

Prem Chintalapudi added 15 commits January 13, 2022 20:23

Fix failed assertion

f18445a

Fix another assertion

6bb878f

Track arrays with metadata instead of function pointers

c7bbeca

Add an array length optimization pass

871bbbc

Add range metadata, simplify conditional folding

9ff3692

Revert range metadata changes

90cdba5

Allow nsw muls to also be used in array-opt cmp optimization

7088d62

Remove length-adjacent field loads (maxsize, nrows, offset) for 1D ar…

d8037fb

…rays

Remove invalid STORE_ARRAY_LEN check

580171b

Add more protection to comparison optimization

d309738

Check instruction type before checking for NSW

ab7b320

Tweak array-opt per vchuravy's suggestions in JuliaLang#43547

5964784

Fix concurrent modification bug

23448d0

Fix another collection of bugs

8a96dcc

Fix a typo

cc79187

pchintalapudi force-pushed the pc/multi-d-array-propagation branch from 4a18af9 to cc79187 Compare January 14, 2022 01:31

pchintalapudi closed this Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate array dimensions from allocations #43487

Propagate array dimensions from allocations #43487

pchintalapudi commented Dec 20, 2021 •

edited

Loading

pchintalapudi commented Dec 26, 2021

pchintalapudi commented Jan 7, 2022

pchintalapudi commented Jan 7, 2022

pchintalapudi commented Jan 8, 2022

vchuravy commented Jan 8, 2022

nanosoldier commented Jan 9, 2022

vchuravy commented Jan 10, 2022

nanosoldier commented Jan 10, 2022

vchuravy commented Jan 10, 2022

vchuravy commented Jan 10, 2022

nanosoldier commented Jan 11, 2022

vchuravy commented Jan 14, 2022

vchuravy commented Jan 14, 2022

vchuravy commented Jan 14, 2022

nanosoldier commented Jan 14, 2022

nanosoldier commented Jan 14, 2022

vtjnash commented Jan 14, 2022

vtjnash commented Jan 14, 2022

vchuravy commented Jan 14, 2022

pchintalapudi commented Jan 14, 2022

vtjnash commented Jan 14, 2022

nanosoldier commented Jan 14, 2022

pchintalapudi commented Jan 14, 2022

PallHaraldsson commented Nov 3, 2022

Propagate array dimensions from allocations #43487

Propagate array dimensions from allocations #43487

Conversation

pchintalapudi commented Dec 20, 2021 • edited Loading

pchintalapudi commented Dec 26, 2021

pchintalapudi commented Jan 7, 2022

pchintalapudi commented Jan 7, 2022

Compile time benchmarks:

pchintalapudi commented Jan 8, 2022

Other notable examples of optimization improvement:

Example 2: doc/src/manual/performance-tips.md:952-963 loopinc

Master:

PR:

Example 3: doc/src/manual/performance-tips.md:969-984 loopinc_prealloc

Master:

PR:

vchuravy commented Jan 8, 2022

nanosoldier commented Jan 9, 2022

vchuravy commented Jan 10, 2022

nanosoldier commented Jan 10, 2022

vchuravy commented Jan 10, 2022

vchuravy commented Jan 10, 2022

nanosoldier commented Jan 11, 2022

vchuravy commented Jan 14, 2022

vchuravy commented Jan 14, 2022

vchuravy commented Jan 14, 2022

nanosoldier commented Jan 14, 2022

nanosoldier commented Jan 14, 2022

vtjnash commented Jan 14, 2022

vtjnash commented Jan 14, 2022

vchuravy commented Jan 14, 2022

pchintalapudi commented Jan 14, 2022

vtjnash commented Jan 14, 2022

nanosoldier commented Jan 14, 2022

pchintalapudi commented Jan 14, 2022

PallHaraldsson commented Nov 3, 2022

pchintalapudi commented Dec 20, 2021 •

edited

Loading