Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preallocate actually has more (while not by number of bytes) allocations! #22

Open
PallHaraldsson opened this issue Apr 20, 2020 · 13 comments

Comments

@PallHaraldsson
Copy link

PallHaraldsson commented Apr 20, 2020

FYI: I was looking into this package, and ran the code from the README, and (at least on Julia 1.5) it has two extra allocations.

[From the tag-line: "Remember what memory we needed last time and use it gain every time after", better to change to "to gain" and finish with a period?]

Since preallocation wasn't clearly (much) faster, I was thinking isn't that the point (and to have less/or not variability for real-time), maybe the timing wasn't reliable, so I checked:

julia> @btime f(A, B, C);
  33.333 ms (4 allocations: 15.26 MiB)

julia> @btime preallocated_f(A, B, C);
  27.981 ms (6 allocations: 272 bytes)

julia> @btime preallocated_f(A, B, C);
  38.364 ms (6 allocations: 272 bytes)

julia> @btime preallocated_f(A, B, C);
  21.913 ms (6 allocations: 272 bytes)

julia> @btime f(A, B, C);
  22.533 ms (4 allocations: 15.26 MiB)

I guess this is also useful for GPUs, some such way needed there?

Is a better example needed to show the usefulness? FYI2, I tried with a larger arrays (and I think it mostly shows how fast/good Julia's GC is):

julia> A, B, C = (rand(10000, 10000) for _ in 1:3)
Base.Generator{UnitRange{Int64},var"#14#15"}(var"#14#15"(), 1:3)

[..]

julia> @btime f(A, B, C);
  23.648 s (4 allocations: 1.49 GiB)

julia> @btime preallocated_f(A, B, C);
  22.597 s (6 allocations: 272 bytes)

julia> @time preallocated_f(A, B, C);
 22.931816 seconds (6 allocations: 272 bytes)

julia> @time f(A, B, C);
 22.840137 seconds (4 allocations: 1.490 GiB)
@PallHaraldsson PallHaraldsson changed the title preallocate actually has more allocations! preallocate actually has more (while not by number of bytes) allocations! Apr 20, 2020
@oxinabox
Copy link
Owner

Correct, preallocation has more allocations, but they are much smaller.
They should not be there, and I think can be removed, it just required a fair bit of careful digging around to find and fix them.
I think they are related to type inference failing.
Getting down to zero allocations in using a thing is fiddly

I believe the performance changes are not visiable in small benchmarks.
A better example is needed.
I suspect this would do quite well on Optim (not sure how dynamic optims memory allocation pattern is though)
@Roger-luo reports big speedups with Zygote/Flux

@PallHaraldsson
Copy link
Author

PallHaraldsson commented Apr 20, 2020

small benchmarks

See me addition above. Yes, I figured the actual number of allocations are not a big deal, still surprising that the amount of bytes doesn't matter much. [Most useful comment here might be about the tagline?]

@oxinabox
Copy link
Owner

small benchmarks
See me addition above. Yes, I figured the actual number of allocations, not a big deal, still surprising that the amount of bytes doesn't matter much.

Yeah not just size of memory allcoation, i think this shows up most when you have a large number of largish allocations.
In particular you need to be actually triggering GC for it to really matter I think.

(and I think it mostly shows how fast/good Julia's GC is):

You are probably finding that julia GC is not being triggered.
GC only runs when you start to run a bit low on RAM.

But it is showing how good Julia's allocator is at creating memory

@PallHaraldsson
Copy link
Author

PallHaraldsson commented Apr 20, 2020

The D language has @nogc for functions (and those functions can only call such functions). I've thought such might be useful for Julia (should be able to do in a macro? While tricky since it has to limit calling to calling such functions).

But it is showing how good Julia's allocator is at creating memory

Right, I was trying to exercise it, and yes, seeing GC triggered (might work on your machine to test). Since I've gotten 128 GB in my desktop, I've probably not seen it since...

As triggering GC often actually happens on other machines, being able to simulate such an environment would be nice and good for knowing the slowdown, maybe I (or could BenchmarkTools simulate such an environment?) should sometimes time with some VM or some settings (ulimit); prlimit which wasn't too helpful:

prlimit -d700000000:700000000 ~/julia-1.5-012b270df6/bin/julia

@Roger-luo
Copy link
Collaborator

I think somehow there is a performance regression on Julia 1.5, although I thought JuliaLang/julia#34126 would fix our last a few small allocations due to RefValue, now we are having more allocations even in the README example...

I'm not sure where the allocation actually happens, @oxinabox any idea how to actually find the allocations? seems quite nasty to read the IR code directly to me.

@oxinabox
Copy link
Owner

oxinabox commented May 7, 2020

I do not have a good way

@cscherrer
Copy link

This is supposed to do it, but it's not generating .mem files for me. Not sure what I'm doing wrong:
https://docs.julialang.org/en/v1/manual/profile/#Memory-allocation-analysis

@cscherrer
Copy link

cscherrer commented Mar 19, 2021

Ok so this is not a great way to do it, but it's a start...

  1. Make sure the functions you want to inspect are in a file. They might even need to be in a package, I'm not sure about that. But the REPL isn't enough.
  2. Start Julia with julia --track-allocation=all
  3. Call the functions you're interested in
  4. Call Profile.clear_malloc_data()
  5. Call them again, this time for the actual measurement
  6. Exit Julia

Now, you can use fd "\.mem" using the fd utility (or find is fine though a bit slower) to get the files.

To get the actual lines, you can do fd "\.mem" | xargs sort -n

In this case, the ones with actual numbers (allocations) were

       32 println(io::IO, xs...) = print(io, xs..., "\n")
       48     sort!(p, alg, Perm(ordr,v))
       80     b = Vector{T}(undef, length(a))
       96 println(xs...) = println(stdout::IO, xs...)
      128         y = iterate(itr)
      144         show(io, x)
      224         collect_to_with_first!(_array_for(typeof(v1), itr.iter, isz), v1, itr, st)
      256         y = iterate(itr, st)
      432             print(io, x)
      448 indexed_iterate(t::Tuple, i::Int, state=1) = (@_inline_meta; (getfield(t, i), i+1))
      480                 let t = t, val = val; _all(i->val[i] isa fieldtype(t, i), 1:n); end
      528     return :($(esc(ex)) ? $(nothing) : throw(AssertionError($msg)))
     1456     esc(isa(ex, Expr) ? pushmeta!(ex, :noinline) : ex)
     2000     p = similar(Vector{eltype(ax)}, ax)
     4000     dest = similar(A, shape)

Note that if you run it several times you'll end up with several timestams. You can do ls -alth and see which ones are the newest. The number before mem is a sort of tag.

In my case, the newest one is 987335. So

➤ fd "\.mem" | grep 987335
julia/julia-1.6.0-rc1/share/julia/base/Base.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/abstractarray.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/array.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/arrayshow.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/boot.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/channels.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/compiler/abstractinterpretation.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/compiler/stmtinfo.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/coreio.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/deepcopy.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/deprecated.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/dict.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/error.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/errorshow.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/essentials.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/expr.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/float.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/generator.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/gmp.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/io.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/iterators.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/loading.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/meta.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/mpfr.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/multidimensional.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/multimedia.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/operators.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/ordering.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/process.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/reduce.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/reducedim.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/reflection.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/regex.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/set.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/show.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/sort.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/stream.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/strings/io.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/subarray.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/task.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/threads.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/tuple.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/util.jl.987335.mem

and then

➤ fd "\.mem" | grep 987335 | xargs sort -n | tail -n 20
        - "∀ ε > 0, ∃ δ > 0: |x-y| < δ ⇒ |f(x)-f(y)| < ε"
        - "εδxyδfxfyε"
        - "π"
        - "\\\\0"
        - "\\\\x000"
       32 println(io::IO, xs...) = print(io, xs..., "\n")
       48     sort!(p, alg, Perm(ordr,v))
       80     b = Vector{T}(undef, length(a))
       96 println(xs...) = println(stdout::IO, xs...)
      128         y = iterate(itr)
      144         show(io, x)
      224         collect_to_with_first!(_array_for(typeof(v1), itr.iter, isz), v1, itr, st)
      256         y = iterate(itr, st)
      432             print(io, x)
      448 indexed_iterate(t::Tuple, i::Int, state=1) = (@_inline_meta; (getfield(t, i), i+1))
      480                 let t = t, val = val; _all(i->val[i] isa fieldtype(t, i), 1:n); end
      528     return :($(esc(ex)) ? $(nothing) : throw(AssertionError($msg)))
     1456     esc(isa(ex, Expr) ? pushmeta!(ex, :noinline) : ex)
     2000     p = similar(Vector{eltype(ax)}, ax)
     4000     dest = similar(A, shape)

@dpinol
Copy link

dpinol commented Oct 17, 2022

With julia 1.8, this sample case generates 1k allocations instead of 1

julia> using AutoPreallocation

julia> function f()
                       x = rand(100)
                       nothing
                     end
f (generic function with 1 method)

julia> x, preallocated_f = preallocate(f);

julia> preallocated_f()


julia> @timev  preallocated_f()
  0.000053 seconds (1.20 k allocations: 24.906 KiB)
elapsed time (ns):  52981
gc time (ns):       0
bytes allocated:    25504
pool allocs:        1196
non-pool GC allocs: 0
minor collections:  0
full collections:   0

julia> @timev  f()
  0.000004 seconds (1 allocation: 896 bytes)
elapsed time (ns):  4020
gc time (ns):       0
bytes allocated:    896
pool allocs:        1
non-pool GC allocs: 0
minor collections:  0
full collections:   0

@oxinabox
Copy link
Owner

This can't really be fixed here (except in special case), it needs to be fixed in Cassette or more likely by rewriting this with MixTape.jl

@PallHaraldsson
Copy link
Author

I don't know much about implementation of this package or Cassette.jl it depends on (or Mixtape.jl). Should some issue be opened at Cassette.jl? Is that getting to be a redundant package?

Should this issue be kept open or not planned to be fixed? I'm not rushing anyone, I was just exploring and opened the issue, I'm not (really) a user. Possibly the status should be documented.

@oxinabox
Copy link
Owner

oxinabox commented Dec 8, 2022

This whole package should be documented as an experiment and largely as a deadend.
The idea is fine but the ability to use it in practice is severely limited

@bvdmitri
Copy link

The same here:

julia> using Distributions, AutoPreallocation

julia> L = rand(2, 2);

julia> S = 2I + L * L';

julia> d = MvNormal(rand(2), S);

julia> x = rand(2);

julia> logpdf(d, x)
-2.8535848089575633

julia> @btime logpdf($d, $x)
  118.435 ns (3 allocations: 176 bytes)
-2.8535848089575633

julia> _, pf = preallocate(logpdf, d, x)
(-2.8535848089575633, preallocate(logpdf, ::FullNormal, ::Vector{Float64}))

julia> @btime $pf($d, $x)
  266.003 ns (9 allocations: 272 bytes)
-2.8535848089575633

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants