-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocation profiler: get stacks for pool allocs via wrapper function #43868
allocation profiler: get stacks for pool allocs via wrapper function #43868
Conversation
Discussion carried over from vilterp#17:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon discussion, we don't think this is doable now, until we switch to OrcJIT v2 and have swappable function stub support available.
alloc profiler: capture `managed_malloc` and `managed_realloc`
Make an UnknownType for the GC Profiler to handle these unknown types.
Thanks for the info @vtjnash. I think I see how OrcJIT v2 to would make things more elegant. To make sure I understand:
What's the expected timeline on the OrcJIT v2 migration? Is there an issue we can follow? We'd like to merge something in the meantime, if possible… Perhaps we could get rid of the wrapper function either by
Does either of these approaches seem viable? Thanks. |
Yeah that would be to goal. It might even be not to hard to create a prototype for that, and would be needed for an alternative backend to the probe points in #43453.
Yes, this is on the performance sensitive part of the allocator (and Jameson mentioned in offline discussion that we would like to inline the
Might work, but I would like to see a performance study of that. i.e. we would like to avoid using an additional register xD
That could actually work, but is probably a lot less useful. |
Cool, thanks for the explanation! Yeah, those concerns make sense. 👍 Definitely happy to work through the perf concerns, for sure 👍 👍 |
Note from discussion w/ @JeffBezanson and @NHDaly: We're going to try the last approach I proposed:
In the meantime before OrcJIT v2, we'd prefer to have 100% of stacks, even without types, than missing 40% of allocations altogether. While types give a good overview of the workload, stacks give the essential info for drilling down and optimizing. PR coming soon. Edit: we don't want to give up on types just yet, so we're trying an approach (diagrammed above) based on instrumenting where we still have types when possible, using (inlined) wrapper functions. |
Working on an allocations benchmark to try to capture perf overhead of this (and other approaches we might try). Initial results haven't shown a perf impact, although it may not be a sufficiently good benchmark: JuliaCI/BaseBenchmarks.jl#291 (comment) |
* add a wrapper and inline into it * review suggestions Co-authored-by: Nathan Daly <[email protected]>
Btw, this appears to be the spot in the Go memory allocator where they sample the stack trace if allocation profiling is enabled: https://github.com/golang/go/blob/c27a3592aec5f46ae18f7fd3d9ba18e69e2f1dcb/src/runtime/malloc.go#L1166-L1173 Nothing fancy (no compile-time flag), just an if statement (that gets branch predicted) like we're doing here. |
@nanosoldier |
I'm not sure if i have the permissions to kick off nanosoldier either... I haven't had great luck with it in the past. But I'll give it a shot: |
@nanosoldier |
Okay, i should have permissions now (thanks @vchuravy and @oscardssmith ). @nanosoldier |
Your benchmark job has completed - no performance regressions were detected. A full report can be found here. |
Okay! :) Nanosoldier has run and showed no regressions! In fact, it showed that the allocations benchmark was slightly faster with this PR, which is of course nonsense, but it means this change is in the noise, as we predicted. Thanks for pushing for the benchmarking, that makes me feel much better knowing we've measured the impact carefully! 😊 @vtjnash and @vchuravy can you give a final review now? Thanks very much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @vilterp for pushing this forward. This is a nice, clean implementation. 👍
(I co-authored though, of course, so we still need another review.)
Thanks!
@vtjnash: We talked offline on slack last week and you said you weren't sure if you'd be able to get to reviewing this. I'm going to merge it now, since we've had reviews from Valentin and talked it through with you in-person! Please still feel free to leave any review feedback you have, and we can always get to it in follow up PRs. ❤️ thanks again for your help! 😊 |
…cations Profiler Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to allow us to instrument it in the Allocations Profiler. Followup to #43868 Finishes fixing #43688 (gets the stacks, though not the types, of big allocs). ----- - Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined - add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner - replace all internal calls to jl_gc_big_alloc to call this instead - add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner - This one is instrumented, and is meant to be called by generated code.
…cations Profiler (#44043) * Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to allow us to instrument it in the Allocations Profiler. Followup to #43868 Finishes fixing #43688 (gets the stacks, though not the types, of big allocs). ----- - Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined - add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner - replace all internal calls to jl_gc_big_alloc to call this instead - add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner - This one is instrumented, and is meant to be called by generated code. Co-authored-by: Pete Vilter <[email protected]>
…cations Profiler (#44043) * Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to allow us to instrument it in the Allocations Profiler. Followup to #43868 Finishes fixing #43688 (gets the stacks, though not the types, of big allocs). ----- - Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined - add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner - replace all internal calls to jl_gc_big_alloc to call this instead - add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner - This one is instrumented, and is meant to be called by generated code. Co-authored-by: Pete Vilter <[email protected]>
…cations Profiler (JuliaLang#44043) * Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to allow us to instrument it in the Allocations Profiler. Followup to JuliaLang#43868 Finishes fixing JuliaLang#43688 (gets the stacks, though not the types, of big allocs). ----- - Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined - add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner - replace all internal calls to jl_gc_big_alloc to call this instead - add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner - This one is instrumented, and is meant to be called by generated code. Co-authored-by: Pete Vilter <[email protected]>
…cations Profiler (JuliaLang#44043) * Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to allow us to instrument it in the Allocations Profiler. Followup to JuliaLang#43868 Finishes fixing JuliaLang#43688 (gets the stacks, though not the types, of big allocs). ----- - Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined - add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner - replace all internal calls to jl_gc_big_alloc to call this instead - add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner - This one is instrumented, and is meant to be called by generated code. Co-authored-by: Pete Vilter <[email protected]>
…cations Profiler (JuliaLang#44043) * Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to allow us to instrument it in the Allocations Profiler. Followup to JuliaLang#43868 Finishes fixing JuliaLang#43688 (gets the stacks, though not the types, of big allocs). ----- - Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined - add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner - replace all internal calls to jl_gc_big_alloc to call this instead - add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner - This one is instrumented, and is meant to be called by generated code. Co-authored-by: Pete Vilter <[email protected]>
…cations Profiler (#44043) * Instrument jl_gc_big_alloc() when called from generated code for Allocations Profiler Add _maybe_record_alloc_to_profile() call into jl_gc_big_alloc() to allow us to instrument it in the Allocations Profiler. Followup to #43868 Finishes fixing #43688 (gets the stacks, though not the types, of big allocs). ----- - Rename jl_gc_big_alloc to jl_gc_big_alloc_inner, make it inlined - add jl_gc_pool_alloc_noinline that calls jl_gc_big_alloc_inner - replace all internal calls to jl_gc_big_alloc to call this instead - add a new jl_gc_pool_alloc that calls jl_gc_big_alloc_inner - This one is instrumented, and is meant to be called by generated code. Co-authored-by: Pete Vilter <[email protected]>
followup to #42768
fixes #43688 (gets the stacks, though not the types, of pool allocs)
Add a wrapper around jl_gc_pool_alloc to allow us to instrument it.
Before
*: instrumented; red: currently missed
graphviz
After
*: instrumented; red: currently missed; green: newly introduced function
graphviz
Note that from the callers' perspectives, the code is unchanged. The only difference is that one call path now has an additional single (cold) branch, which can be enabled for profiling: