Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation profiler #42768

Merged
merged 112 commits into from
Jan 19, 2022
Merged

Conversation

vilterp
Copy link
Contributor

@vilterp vilterp commented Oct 22, 2021

Overview

Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects.

Alternate approach to existing alloc profiler PR: #33467
Complementary to garbage profiler PR: #42658 (maybe there's some nice way to meld them)

This may be reinventing the wheel from #33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff.

Usage:

using Profile.Allocs
res = Allocs.@profile sample_rate=0.001 my_func()
prof = Allocs.fetch()
# do something with `prof`

See also: JuliaPerf/PProf.jl#46 for support for visualizing these.

Here are two sample screenshots from analyzing all allocations during function compilation and evaluation:

julia> Profile.Allocs.@profile sample_rate=1 begin
           @eval begin f() = 1; f() end
       end
1

julia> PProf.Allocs.to_pprof(Profile.Allocs.fetch(), from_c=false)
"alloc-profile.pb.gz"

Screen Shot 2022-01-07 at 11 55 58 AM

Screen Shot 2022-01-07 at 11 56 21 AM

Output:

Julia structs which can be used by downstream libraries. (See PProf.jl: JuliaPerf/PProf.jl#46)

Performance

Recording the stack traces with LibUnwind for each allocation has non-trivial overhead, so the allocs profiler comes with a sample_rate = 0.0001 that controls the probability of recording a sample for each allocation. With sample_rate = 1, this example benchmark which allocates heavily (for i in 1:100 fi = Symbol("f$i") ; @eval begin $fi() = 3 ; $fi() end) spent around 50% of time recording the stack traces (on my MacBook Pro):

Screen Shot 2022-01-06 at 2 55 05 PM

Since almost all the overhead is in jl_unw_stepn(), there's not much more that can be improved in this PR. This is why we let users control the sample rate, based on their particular workload.

TODO:

Co-Authored-By: @NHDaly

@jpsamaroo
Copy link
Member

Not saying this needs to be done for this PR, but it would be nice if there were an option to output profile results directly as in-memory objects instead of serializing to JSON. I'd like to use this for Dagger.jl to track allocations in real-time, but having to emit and parse JSON (even just within memory) is going to add overhead that feels slightly unnecessary.

@vilterp vilterp mentioned this pull request Oct 23, 2021
4 tasks
@maleadt
Copy link
Member

maleadt commented Oct 23, 2021

This may be reinventing the wheel from #33467, but I'm not sure why that one needs stuff like LLVM passes.

Because most allocations do not go through jl_gc_alloc. Look for uses of the jl_intrinsics::GCAllocBytes; that gets lowered to a call to jl_gc_pool_alloc where the type isn't know. The solution of #33467 was to introduce a new intrinsic recording the type as known during codegen.

@vilterp
Copy link
Contributor Author

vilterp commented Oct 25, 2021

Thank you @maleadt! It did feel like I was getting away with this a bit too easily 😅. I now see that generated code calls jl_gc_pool_alloc directly, and seems to insert the type tag "by hand", i.e. without going through set_typeof, e.g.

  %215 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* nonnull %116, i32 1472, i32 64) #1
  %216 = bitcast {}* %215 to i64*
  %217 = getelementptr inbounds i64, i64* %216, i64 -1
  store atomic i64 4685003024, i64* %217 unordered, align 8

Now I understand why the LLVM modifications in the other PR are necessary! Perhaps rebasing that PR is the right approach after all… 🤔 I tried before, and the amount of conflicts scared me away a bit, haha.

base/gcutils.jl Outdated Show resolved Hide resolved
src/gc-alloc-profiler.cpp Outdated Show resolved Hide resolved
@vilterp
Copy link
Contributor Author

vilterp commented Nov 12, 2021

Still not sure how best to capture the gc_pool_alloc allocs (either conditionally emit an LLVM intrinsic, or plumb the type to it), but in the meantime, it'd be good to make this PR be more like the stdlib Profile module (and #33467), namely:

  1. record the backtraces directly to a buffer, and do the stack trace decoding in Julia (having more logic in Julia is good in general, and hopefully this will fix the segfault we saw)
  2. have an API in Profile, rather than GC; something like
Profile.@profile_allocs foo()
fetch()
# pass to flamegraphs, pprof, etc

which would be defined as something like

macro profile_allocs(ex)
    return quote
        try
            enable_alloc_profiling()
            if status < 0
                error(error_codes[status])
            end
            $(esc(ex))
        finally
            disable_alloc_profiling()
        end
    end
end

Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to @vilterp's comment above, I'd like to take a different approach to this PR than we took in the previous attempts, in order to avoid getting stuck as those PRs have been.

I think an effective approach to this would be to merge the allocations profiling functionality in stages. My impression is that the previous attempts at a memory profiler have gotten stuck in trying to work out the best way to capture the allocations coming from codegen (e.g. gc_pool_alloc). This PR currently does not capture those allocations, but it does still report a meaningfully large proportion of the allocations. From our experiments (measuring against @time) it seems to capture roughly 80% of the allocations in our heavy benchmarks, so these seems still quite useful as-is.

I'd like us to wrap up the existing functionality (profiling all allocations that go through jl_gc_alloc()), and get the PR into a good state, and then review it and merge it, even though it doesn't capture all allocations.

We can file an issue describing the fact that the memory profiler doesn't capture all allocations, with some suggested work plans, and then follow-up with more PRs later. I think that if we add some clear documentation, and maybe even print a warning to the console that "This feature is experimental, and currently does not capture all allocations. The reported allocations are accurate, but maybe be missing a large proportion of the total allocations.", we could merge this and still provide some value, without letting this PR bit rot! :)

Does that sound right to others? @vilterp and I will work this week to wrap up the existing functionality and get things tested and cleaned up. :)

@vilterp
Copy link
Contributor Author

vilterp commented Dec 13, 2021

Sometimes get

ERROR: TypeError: in typeassert, expected Union{Nothing, Core.CodeInfo, Core.MethodInstance}, got a value of type Vector{UInt8}
Stacktrace:
  [1] _reformat_bt(bt::Vector{Ptr{Nothing}}, bt2::Vector{Any})
    @ Base ./error.jl:88
  [2] decode_alloc(cache::Dict{Union{Ptr{Nothing}, Base.InterpreterIP}, Vector{Base.StackTraces.StackFrame}}, raw_alloc::Main.AllocProfile.RawAlloc)
    @ Main.AllocProfile ~/code/julia/stdlib/AllocProfile/src/AllocProfile.jl:94

with this strategy, even though I think I'm doing the same thing as jl_get_backtrace. Error is on this line:

code = bt2[j]::Union{CodeInfo,Core.MethodInstance,Nothing}

Sometimes it's a different type, like Vector{Int64} or Vector{Any}, and sometimes it segfaults. Seems like bt2 is getting corrupted out from under us…

@vilterp
Copy link
Contributor Author

vilterp commented Dec 14, 2021

EDIT: yeah, things are getting GC'd out from under us — if I do GC.enable(false), these type errors go away 🤦‍♂️

@vilterp vilterp force-pushed the pv-alloc-profile-flat-stacks branch from c857d4c to f8a7720 Compare December 14, 2021 01:28
@vilterp
Copy link
Contributor Author

vilterp commented Dec 14, 2021

Weird thing is, it works sometimes, even when full GCs happen during the collection and decode phases:

julia> AllocProfile.start(100)

julia> @time HeapSnapshotParser.parse_snapshot("testdata/empty-repl.heapsnapshot")
GC: pause 68.566574ms. collected 213.528655MB. 229376366 allocs total. incr 
GC: pause 67.191154ms. collected 39.250776MB. 45875220 allocs total. incr 
GC: pause 41.633240ms. collected 44.589300MB. 45875218 allocs total. incr 
GC: pause 52.344458ms. collected 42.013328MB. 45875207 allocs total. full 
GC: pause 105.485016ms. collected 79.103016MB. 91751390 allocs total. full 
GC: pause 86.088875ms. collected 43.294896MB. 45875392 allocs total. incr 
GC: pause 21.218635ms. collected 42.798672MB. 45875768 allocs total. incr 
GC: pause 20.006157ms. collected 44.481792MB. 45876888 allocs total. incr 
GC: pause 22.843812ms. collected 44.484160MB. 45876912 allocs total. incr 
GC: pause 22.722307ms. collected 41.068672MB. 45876632 allocs total. incr 
GC: pause 24.197558ms. collected 37.902328MB. 45875215 allocs total. incr 
GC: pause 20.181365ms. collected 40.804064MB. 45875208 allocs total. incr 
  1.416460 seconds (4.32 M allocations: 372.909 MiB, 22.79% gc time, 13.22% compilation time)
HeapSnapshot

julia> results = @time AllocProfile.stop()
GC: pause 29.494241ms. collected 38.331040MB. 45876416 allocs total. incr 
GC: pause 10.257597ms. collected 30.266950MB. 45875204 allocs total. incr 
GC: pause 9.489497ms. collected 21.904032MB. 46197488 allocs total. incr 
GC: pause 24.214812ms. collected 25.502704MB. 53997360 allocs total. full 
GC: pause 85.942493ms. collected 55.266176MB. 54010624 allocs total. full 
GC: pause 82.195946ms. collected 26.697176MB. 54003344 allocs total. full 
GC: pause 91.935260ms. collected 67.363824MB. 134991008 allocs total. full 
GC: pause 115.030000ms. collected 156.418160MB. 337481136 allocs total. full 
GC: pause 177.543380ms. collected 340.834528MB. 843697408 allocs total. full 
  3.283125 seconds (15.71 M allocations: 1.746 GiB, 18.17% gc time, 31.47% compilation time)
AllocResults

@timholy
Copy link
Member

timholy commented Dec 14, 2021

I haven't looked at the code, but it seems like you'd want to record the data at the moment of allocation, in which case

things are getting GC'd out from under us

doesn't seem like it would be a concern. But again, I haven't looked at this due to time constraints..

@vilterp
Copy link
Contributor Author

vilterp commented Dec 14, 2021

Thanks Tim. We are recording backtraces at the point of allocation. When the backtraces are solely composed of instruction pointers, everything seems to work fine. But when they also include pointers to objects on the heap, it seems those objects are sometimes garbage collected before we decode the stack traces.

We think these are objects associated with the interpretation (as opposed to JIT) mode. Stack frames for interpreted functions leave pointers in the stack trace to objects like CodeInfo. I believe the problem occurs when these objects are GC'd.

@timholy
Copy link
Member

timholy commented Dec 14, 2021

👍 You should be able to serialize those in some more permanent form at the time of collection, possibly under a GC.@preserve or something.

@vilterp
Copy link
Contributor Author

vilterp commented Dec 14, 2021

Yeah, it does seem like we either need to either:

  1. keep all objects pointed to by stack traces rooted til decoding time, by pointing at the buffer from a thread or something, or
  2. decode/serialize more during the recording phase, so we don't have to worry about them.

2 seems nice, but we were hoping to get away with keeping most of the decoding logic in Julia at the end, so we can use functions in stacktraces.jl and error.jl like _reformat_bt, lookup, etc. Maybe doing a bit of decoding in C++ isn't too bad though 🤔 I'll take a look tomorrow.

@NHDaly
Copy link
Member

NHDaly commented Dec 14, 2021

Ah, bummer! Makes sense though! Thanks @timholy for chiming in. That does make sense now that you explain it, @vilterp and @timholy. thanks!

Yeah, i'd be probably fine with either approach.

For 1., we were hoping to avoid allocating any GC-tracked objects during profiling, since this is itself a GC allocations profiler, which is why we didn't want to create any julia-managed objects to point at them...

You could maybe address that by pre-allocating a global Julia vector, @vilterp, and then stuffing the things you need to track into those? That way there's still no allocations? The only weirdness is that Base.gc_live_bytes() would technically still go up, since the vector would may end up growing capacity, but i doubt that anyone will be tracking things that precisely? 🤔

Doing option 2. could work too, yeah. Very interesting (and annoying) dilemma. 👍 Thanks!

@vilterp
Copy link
Contributor Author

vilterp commented Dec 16, 2021

Update, since calling _reformat_bt(bt, bt2) has caused so much weirdness (illegal instruction errors, nonsensical type errors, etc): We're going to mimic the CPU profiler and just skip _reformat_bt altogether, instead just calling lookup(::Ptr{CVoid}), which appears to return a special "unknown" stack frame for the extended entries containing interpreter frames:

isempty(infos) && return [StackFrame(empty_sym, empty_sym, -1, nothing, true, false, pointer)] # this is equal to UNKNOWN

Here's where the CPU profiler calls it:

st = lookup(convert(Ptr{Cvoid}, ip))

If it's good enough for the CPU profiler, it's good enough for us! Also, since we'll no longer need to keep the heap-allocated interpreter frame objects rooted to avoid them being GC'd before we decode the stack trace buffer, we can move the buffer back from Julia arrays into C++, which allows the allocation profiler to not allocate anymore. This seems like a sensible thing.

We'll still decode the buffers in a Julia stdlib module and return Julia structs, which can then be used by downstream libraries like PProf, or e.g. a simple JSON serializer.

Filed #43796 for a more comprehensive fix for this.

@vilterp vilterp force-pushed the pv-alloc-profile-flat-stacks branch from 439f06b to 1474097 Compare December 18, 2021 04:16
@vilterp
Copy link
Contributor Author

vilterp commented Dec 21, 2021

Split off GC logging in to #43511

NHDaly added a commit to JuliaPerf/PProf.jl that referenced this pull request Jan 4, 2022
Support for visualizing the results from the allocations profiler in
draft PR:
JuliaLang/julia#42768.

This was basically copy/pasted from
https://github.com/vilterp/AllocProfileParser.jl.
NHDaly added a commit to JuliaPerf/PProf.jl that referenced this pull request Jan 4, 2022
Support for visualizing the results from the allocations profiler in
draft PR:
JuliaLang/julia#42768.

This was basically copy/pasted from
https://github.com/vilterp/AllocProfileParser.jl.
NHDaly added a commit to JuliaPerf/PProf.jl that referenced this pull request Jan 4, 2022
Support for visualizing the results from the allocations profiler in
draft PR:
JuliaLang/julia#42768.

This was basically copy/pasted from
https://github.com/vilterp/AllocProfileParser.jl.
@vchuravy vchuravy deleted the pv-alloc-profile-flat-stacks branch January 20, 2022 02:57
@vilterp
Copy link
Contributor Author

vilterp commented Jan 20, 2022

Error:

/cache/build/default-amdci5-3/julialang/julia-master/src/gc-alloc-profiler.cpp:123:56: error: Calling potential safepoint from function annotated JL_NOTSAFEPOINT [julia.GCChecker]
    auto& profile = global_profile.per_thread_profiles[jl_threadid()];
                                                       ^~~~~~~~~~~~~

@DilumAluthge
Copy link
Member

If it's going to take some time to fix analyzegc, we can revert this PR, and reland it later once you have the fix. That way, you don't have to rush to put the fix together.

DilumAluthge added a commit that referenced this pull request Jan 20, 2022
@jpsamaroo
Copy link
Member

I'm not sure how accessing the current task's TID is considered a safepoint? I think it might need a JL_NOTSAFEPOINT annotation?

@DilumAluthge
Copy link
Member

If it's going to take some time to fix analyzegc, we can revert this PR, and reland it later once you have the fix. That way, you don't have to rush to put the fix together.

#43872

@Keno
Copy link
Member

Keno commented Jan 20, 2022

I'm not sure how accessing the current task's TID is considered a safepoint? I think it might need a JL_NOTSAFEPOINT annotation?

We follow a "guilty until declared innocent" approach here :).

@vtjnash
Copy link
Member

vtjnash commented Jan 20, 2022

You generally want jl_current_task->tid anyways

@vilterp
Copy link
Contributor Author

vilterp commented Jan 20, 2022

Thanks @vtjnash, trying that here: #43875

@NHDaly
Copy link
Member

NHDaly commented Jan 20, 2022

Ah, thanks, it was the extra failure in the buildkite job. Sorry I didn't do a good enough job comparing that with the existing failures. Thanks for catching this, @Keno! 🙏

@vilterp
Copy link
Contributor Author

vilterp commented Jan 20, 2022

k, the fix has passed analyzegc; waiting for the rest of CI…

NHDaly added a commit to JuliaPerf/PProf.jl that referenced this pull request Jan 25, 2022
…er (#46)

* Add support for the Allocs profiles produced by Julia's Allocs Profiler

Support for visualizing the results from the allocations profiler in
draft PR:
JuliaLang/julia#42768.

This was basically copy/pasted from
https://github.com/vilterp/AllocProfileParser.jl.

* Rename to PProf.Allocs.pprof()

* Fix minimum julia version based on merge-date

* Add tests for PProf.Allocs.pprof(); fix typo

Co-authored-by: Pete Vilter <[email protected]>
Co-authored-by: Pete Vilter <[email protected]>
Co-authored-by: Valentin Churavy <[email protected]>

jl_raw_backtrace_t get_raw_backtrace() {
// A single large buffer to record backtraces onto
static jl_bt_element_t static_bt_data[JL_MAX_BT_SIZE];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread safe. You should be (re)using the existing buffer in the ptls to be safe here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Yeah, we need a buffer per thread…

Did you mean to (re)use this buffer? https://github.com/vilterp/julia/blob/c12aca890a8ae387d11db0b54351f8b61305c00b/src/julia_threads.h#L247

It seems like we should add our own (here or as a global) to avoid colliding with that… We've shied away from adding stuff to ptls to avoid breaking stuff or adding overhead, but maybe it's better than globals?

Funny our multithreaded test didn't catch this; maybe the stack traces were just garbage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you avoid re-using this buffer in the interval between an error filling this buffer, and then allocating the exception stack, I think you should be okay here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me a little nervous to reuse the buffer for two things, but it is space-efficient! I guess it should be fine like you say…

filed #44099 to track this; we'll write up a PR soon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backtrace would be essentially the same, just a bit longer perhaps

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vtjnash for pointing this out! 😅 haha it feels pretty silly that we missed that. Thanks for the help!!

The suggestion to use a buffer in ptls makes sense. I agree with @vilterp that sharing that buffer makes me slightly nervous, but reading through your description that sounds totally fine 👍 cool, thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the CPU profiler use that buffer at all? If someone is running CPU profiling and Allocs profiling at the same time, is there a chance it'll interrupt our thread while we were in the middle of using the buffer and we'll get any problems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiling using a different buffer entirely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question: is it okay to write over this buffer, since it seems like it's used to scan for roots?:

julia/src/task.c

Lines 344 to 345 in 7ccc83e

// storing bt_size in ptls ensures roots in bt_data will be found
ptls->bt_size = rec_backtrace(ptls->bt_data, JL_MAX_BT_SIZE, skip + 1);

julia/src/task.c

Lines 618 to 623 in 7ccc83e

// The temporary ptls->bt_data is rooted by special purpose code in the
// GC. This exists only for the purpose of preserving bt_data until we
// set ptls->bt_size=0 below.
jl_push_excstack(&ct->excstack, exception,
ptls->bt_data, ptls->bt_size);
ptls->bt_size = 0;

I think that it's fine, from what i can read, but just want to double check one more time.

Also, i've opened a PR for this, here: #44114

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
## Overview

Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects.

Alternate approach to existing alloc profiler PR: JuliaLang#33467
Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them)

This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff.

## Usage:

```julia
using Profile.Allocs
res = Allocs.@Profile sample_rate=0.001 my_func()
prof = Allocs.fetch()
# do something with `prof`
```

See also: JuliaPerf/PProf.jl#46 for support for visualizing these.

Co-authored-by: Nathan Daly <[email protected]>
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
## Overview

Record the type and stack of every allocation (or only at a given sample interval), and return as Julia objects.

Alternate approach to existing alloc profiler PR: JuliaLang#33467
Complementary to garbage profiler PR: JuliaLang#42658 (maybe there's some nice way to meld them)

This may be reinventing the wheel from JuliaLang#33467, but I'm not sure why that one needs stuff like LLVM passes. I mimicked some stuff from it, but this was my attempt to get something up and running. Could easily be missing stuff.

## Usage:

```julia
using Profile.Allocs
res = Allocs.@Profile sample_rate=0.001 my_func()
prof = Allocs.fetch()
# do something with `prof`
```

See also: JuliaPerf/PProf.jl#46 for support for visualizing these.

Co-authored-by: Nathan Daly <[email protected]>
vtjnash added a commit that referenced this pull request Nov 1, 2023
…ile.print

Defines a converter for Allocs to the types implemented by Profile for
printing, allowing flamegraph report generation of allocations with just
the stdlib tooling.

Refs: #42768
vtjnash added a commit that referenced this pull request Nov 1, 2023
…ile.print

Defines a converter for Allocs to the types implemented by Profile for
printing, allowing flamegraph report generation of allocations with just
the stdlib tooling.

Make these methods public as well, so that users can access the
documentation without getting warnings.

Refs: #42768
vtjnash added a commit that referenced this pull request Nov 7, 2023
…ile.print (#51981)

Defines a converter for Allocs to the types implemented by Profile for
printing, allowing flamegraph report generation of allocations with just
the stdlib tooling.

Refs: #42768

For old versions of Julia, you could previously get download a copy of
this code that could be included as a polyfill to add this:
https://gist.github.com/vtjnash/c6aa4db9dafccb0fd28a65f87d6b1adb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector profiler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants