preallocate actually has more (while not by number of bytes) allocations! #22

PallHaraldsson · 2020-04-20T11:03:30Z

FYI: I was looking into this package, and ran the code from the README, and (at least on Julia 1.5) it has two extra allocations.

[From the tag-line: "Remember what memory we needed last time and use it gain every time after", better to change to "to gain" and finish with a period?]

Since preallocation wasn't clearly (much) faster, I was thinking isn't that the point (and to have less/or not variability for real-time), maybe the timing wasn't reliable, so I checked:

julia> @btime f(A, B, C);
  33.333 ms (4 allocations: 15.26 MiB)

julia> @btime preallocated_f(A, B, C);
  27.981 ms (6 allocations: 272 bytes)

julia> @btime preallocated_f(A, B, C);
  38.364 ms (6 allocations: 272 bytes)

julia> @btime preallocated_f(A, B, C);
  21.913 ms (6 allocations: 272 bytes)

julia> @btime f(A, B, C);
  22.533 ms (4 allocations: 15.26 MiB)

I guess this is also useful for GPUs, some such way needed there?

Is a better example needed to show the usefulness? FYI2, I tried with a larger arrays (and ~~I think it mostly shows how fast/good Julia's GC~~ is):

julia> A, B, C = (rand(10000, 10000) for _ in 1:3)
Base.Generator{UnitRange{Int64},var"#14#15"}(var"#14#15"(), 1:3)

[..]

julia> @btime f(A, B, C);
  23.648 s (4 allocations: 1.49 GiB)

julia> @btime preallocated_f(A, B, C);
  22.597 s (6 allocations: 272 bytes)

julia> @time preallocated_f(A, B, C);
 22.931816 seconds (6 allocations: 272 bytes)

julia> @time f(A, B, C);
 22.840137 seconds (4 allocations: 1.490 GiB)

The text was updated successfully, but these errors were encountered:

oxinabox · 2020-04-20T11:18:27Z

Correct, preallocation has more allocations, but they are much smaller.
They should not be there, and I think can be removed, it just required a fair bit of careful digging around to find and fix them.
I think they are related to type inference failing.
Getting down to zero allocations in using a thing is fiddly

I believe the performance changes are not visiable in small benchmarks.
A better example is needed.
I suspect this would do quite well on Optim (not sure how dynamic optims memory allocation pattern is though)
@Roger-luo reports big speedups with Zygote/Flux

PallHaraldsson · 2020-04-20T11:22:12Z

small benchmarks

See me addition above. Yes, I figured the actual number of allocations are not a big deal, still surprising that the amount of bytes doesn't matter much. [Most useful comment here might be about the tagline?]

oxinabox · 2020-04-20T11:24:03Z

small benchmarks
See me addition above. Yes, I figured the actual number of allocations, not a big deal, still surprising that the amount of bytes doesn't matter much.

Yeah not just size of memory allcoation, i think this shows up most when you have a large number of largish allocations.
In particular you need to be actually triggering GC for it to really matter I think.

(and I think it mostly shows how fast/good Julia's GC is):

You are probably finding that julia GC is not being triggered.
GC only runs when you start to run a bit low on RAM.

But it is showing how good Julia's allocator is at creating memory

PallHaraldsson · 2020-04-20T12:01:40Z

The D language has @nogc for functions (and those functions can only call such functions). I've thought such might be useful for Julia (should be able to do in a macro? While tricky since it has to limit calling to calling such functions).

But it is showing how good Julia's allocator is at creating memory

Right, I was trying to exercise it, and yes, seeing GC triggered (might work on your machine to test). Since I've gotten 128 GB in my desktop, I've probably not seen it since...

As triggering GC often actually happens on other machines, being able to simulate such an environment would be nice and good for knowing the slowdown, maybe I (or could BenchmarkTools simulate such an environment?) should sometimes time with some VM or some settings (ulimit); prlimit which wasn't too helpful:

prlimit -d700000000:700000000 ~/julia-1.5-012b270df6/bin/julia

Roger-luo · 2020-05-07T13:56:07Z

I think somehow there is a performance regression on Julia 1.5, although I thought JuliaLang/julia#34126 would fix our last a few small allocations due to RefValue, now we are having more allocations even in the README example...

I'm not sure where the allocation actually happens, @oxinabox any idea how to actually find the allocations? seems quite nasty to read the IR code directly to me.

oxinabox · 2020-05-07T15:56:36Z

I do not have a good way

cscherrer · 2021-03-17T16:10:44Z

This is supposed to do it, but it's not generating .mem files for me. Not sure what I'm doing wrong:
https://docs.julialang.org/en/v1/manual/profile/#Memory-allocation-analysis

cscherrer · 2021-03-19T00:24:29Z

Ok so this is not a great way to do it, but it's a start...

Make sure the functions you want to inspect are in a file. They might even need to be in a package, I'm not sure about that. But the REPL isn't enough.
Start Julia with julia --track-allocation=all
Call the functions you're interested in
Call Profile.clear_malloc_data()
Call them again, this time for the actual measurement
Exit Julia

Now, you can use fd "\.mem" using the fd utility (or find is fine though a bit slower) to get the files.

To get the actual lines, you can do fd "\.mem" | xargs sort -n

In this case, the ones with actual numbers (allocations) were

       32 println(io::IO, xs...) = print(io, xs..., "\n")
       48     sort!(p, alg, Perm(ordr,v))
       80     b = Vector{T}(undef, length(a))
       96 println(xs...) = println(stdout::IO, xs...)
      128         y = iterate(itr)
      144         show(io, x)
      224         collect_to_with_first!(_array_for(typeof(v1), itr.iter, isz), v1, itr, st)
      256         y = iterate(itr, st)
      432             print(io, x)
      448 indexed_iterate(t::Tuple, i::Int, state=1) = (@_inline_meta; (getfield(t, i), i+1))
      480                 let t = t, val = val; _all(i->val[i] isa fieldtype(t, i), 1:n); end
      528     return :($(esc(ex)) ? $(nothing) : throw(AssertionError($msg)))
     1456     esc(isa(ex, Expr) ? pushmeta!(ex, :noinline) : ex)
     2000     p = similar(Vector{eltype(ax)}, ax)
     4000     dest = similar(A, shape)

Note that if you run it several times you'll end up with several timestams. You can do ls -alth and see which ones are the newest. The number before mem is a sort of tag.

In my case, the newest one is 987335. So

➤ fd "\.mem" | grep 987335
julia/julia-1.6.0-rc1/share/julia/base/Base.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/abstractarray.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/array.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/arrayshow.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/boot.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/channels.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/compiler/abstractinterpretation.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/compiler/stmtinfo.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/coreio.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/deepcopy.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/deprecated.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/dict.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/error.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/errorshow.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/essentials.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/expr.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/float.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/generator.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/gmp.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/io.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/iterators.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/loading.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/meta.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/mpfr.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/multidimensional.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/multimedia.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/operators.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/ordering.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/process.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/reduce.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/reducedim.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/reflection.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/regex.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/set.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/show.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/sort.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/stream.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/strings/io.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/subarray.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/task.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/threads.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/tuple.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/util.jl.987335.mem

and then

➤ fd "\.mem" | grep 987335 | xargs sort -n | tail -n 20
        - "∀ ε > 0, ∃ δ > 0: |x-y| < δ ⇒ |f(x)-f(y)| < ε"
        - "εδxyδfxfyε"
        - "π"
        - "ℵ\\\\0"
        - "ℵ\\\\x000"
       32 println(io::IO, xs...) = print(io, xs..., "\n")
       48     sort!(p, alg, Perm(ordr,v))
       80     b = Vector{T}(undef, length(a))
       96 println(xs...) = println(stdout::IO, xs...)
      128         y = iterate(itr)
      144         show(io, x)
      224         collect_to_with_first!(_array_for(typeof(v1), itr.iter, isz), v1, itr, st)
      256         y = iterate(itr, st)
      432             print(io, x)
      448 indexed_iterate(t::Tuple, i::Int, state=1) = (@_inline_meta; (getfield(t, i), i+1))
      480                 let t = t, val = val; _all(i->val[i] isa fieldtype(t, i), 1:n); end
      528     return :($(esc(ex)) ? $(nothing) : throw(AssertionError($msg)))
     1456     esc(isa(ex, Expr) ? pushmeta!(ex, :noinline) : ex)
     2000     p = similar(Vector{eltype(ax)}, ax)
     4000     dest = similar(A, shape)

dpinol · 2022-10-17T11:36:14Z

With julia 1.8, this sample case generates 1k allocations instead of 1

julia> using AutoPreallocation

julia> function f()
                       x = rand(100)
                       nothing
                     end
f (generic function with 1 method)

julia> x, preallocated_f = preallocate(f);

julia> preallocated_f()


julia> @timev  preallocated_f()
  0.000053 seconds (1.20 k allocations: 24.906 KiB)
elapsed time (ns):  52981
gc time (ns):       0
bytes allocated:    25504
pool allocs:        1196
non-pool GC allocs: 0
minor collections:  0
full collections:   0

julia> @timev  f()
  0.000004 seconds (1 allocation: 896 bytes)
elapsed time (ns):  4020
gc time (ns):       0
bytes allocated:    896
pool allocs:        1
non-pool GC allocs: 0
minor collections:  0
full collections:   0

oxinabox · 2022-10-28T18:15:15Z

This can't really be fixed here (except in special case), it needs to be fixed in Cassette or more likely by rewriting this with MixTape.jl

PallHaraldsson · 2022-10-28T18:37:55Z

I don't know much about implementation of this package or Cassette.jl it depends on (or Mixtape.jl). Should some issue be opened at Cassette.jl? Is that getting to be a redundant package?

Should this issue be kept open or not planned to be fixed? I'm not rushing anyone, I was just exploring and opened the issue, I'm not (really) a user. Possibly the status should be documented.

oxinabox · 2022-12-08T20:32:14Z

This whole package should be documented as an experiment and largely as a deadend.
The idea is fine but the ability to use it in practice is severely limited

bvdmitri · 2023-07-31T20:14:42Z

The same here:

julia> using Distributions, AutoPreallocation

julia> L = rand(2, 2);

julia> S = 2I + L * L';

julia> d = MvNormal(rand(2), S);

julia> x = rand(2);

julia> logpdf(d, x)
-2.8535848089575633

julia> @btime logpdf($d, $x)
  118.435 ns (3 allocations: 176 bytes)
-2.8535848089575633

julia> _, pf = preallocate(logpdf, d, x)
(-2.8535848089575633, preallocate(logpdf, ::FullNormal, ::Vector{Float64}))

julia> @btime $pf($d, $x)
  266.003 ns (9 allocations: 272 bytes)
-2.8535848089575633

PallHaraldsson changed the title ~~preallocate actually has more allocations!~~ preallocate actually has more (while not by number of bytes) allocations! Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preallocate actually has more (while not by number of bytes) allocations! #22

preallocate actually has more (while not by number of bytes) allocations! #22

PallHaraldsson commented Apr 20, 2020 •

edited

Loading

oxinabox commented Apr 20, 2020

PallHaraldsson commented Apr 20, 2020 •

edited

Loading

oxinabox commented Apr 20, 2020

PallHaraldsson commented Apr 20, 2020 •

edited

Loading

Roger-luo commented May 7, 2020

oxinabox commented May 7, 2020

cscherrer commented Mar 17, 2021

cscherrer commented Mar 19, 2021 •

edited

Loading

dpinol commented Oct 17, 2022

oxinabox commented Oct 28, 2022

PallHaraldsson commented Oct 28, 2022

oxinabox commented Dec 8, 2022

bvdmitri commented Jul 31, 2023

preallocate actually has more (while not by number of bytes) allocations! #22

preallocate actually has more (while not by number of bytes) allocations! #22

Comments

PallHaraldsson commented Apr 20, 2020 • edited Loading

oxinabox commented Apr 20, 2020

PallHaraldsson commented Apr 20, 2020 • edited Loading

oxinabox commented Apr 20, 2020

PallHaraldsson commented Apr 20, 2020 • edited Loading

Roger-luo commented May 7, 2020

oxinabox commented May 7, 2020

cscherrer commented Mar 17, 2021

cscherrer commented Mar 19, 2021 • edited Loading

dpinol commented Oct 17, 2022

oxinabox commented Oct 28, 2022

PallHaraldsson commented Oct 28, 2022

oxinabox commented Dec 8, 2022

bvdmitri commented Jul 31, 2023

PallHaraldsson commented Apr 20, 2020 •

edited

Loading

PallHaraldsson commented Apr 20, 2020 •

edited

Loading

PallHaraldsson commented Apr 20, 2020 •

edited

Loading

cscherrer commented Mar 19, 2021 •

edited

Loading