-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile.Allocs: Add task and timestamp #44055
Conversation
One thing I'm not sure about is whether embedding the task as |
Yeah I think this is incorrect, don't recall the precise code but I think the allocs buffer is not GC integrated and can thus not contain mutable structs. @vilterp correct me if I am wrong |
Ok latest push switches to |
Yep @vchuravy is right; nothing retains those task objects. |
size | ||
size, | ||
jl_current_task, | ||
cycleclock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to use jl_hrtime
aka Base.time_ns
? Since the task pointer does not uniquely identify the task (as GC can recycle it), you'd need to correlate the allocation event with the window defined by the creation and finish times of the task to recover the accurate task identity. To join this data with the event data collected on the user-side Julia program (e.g., Dagger), it sounds more useful to use the clock available in Julia. Also, https://en.wikipedia.org/wiki/Time_Stamp_Counter suggests rdtsc
used for cycleclock
on x86_64 does not seem to work as a global clock with different CPUs.
A bit aside, but I wonder if it's better to have a "true" task ID that is guaranteed be unique within a single process (e.g., prepare const TASKIDS = [1:nthreads();]
and then issue TASKIDS[threadid()] += nthreads()
in the task constructor in C). Although it's a bit annoying to have 64 more bits if you need it only for profiling...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointers do uniquely identify an object. That is the definition of a pointer
rdtsc is synced by the GC, but we could use the rdtscp version too if we must
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vtjnash pointers don't uniquely identify objects that have been freed over a session's entire life, which is something that I'll probably need to deal with eventually.
rdtsc is synced by the GC, but we could use the rdtscp version too if we must
For my understanding, can you point out where this happens in the GC code?
For jl_hrtime
, from a quick @btime time_ns()
, it seems to take about 1 microsecond, which is not huge, but also not tiny.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I put cycleclock()
in there due to the runtime profiler using it as well. Maybe that's OK for that profiler because we get the thread ID that the stack was recorded on? Although if rtdsc
is per-CPU core, it can be unreliable as a thread gets scheduled on a different core (which I can account for with BPF probes, but it'd be nice to not have to do that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the case decades ago for rdtsc, but the hardware has long since been corrected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. Does it mean that, with
ref = Threads.Atomic{UInt64}(0)
@spawn ref[] = cycleclock()
@spawn ok = ref[] <= cycleclock()
it is guaranteed that ok === true
always if appropriate fences are inserted?1 Also, does it provide a global clock even across CPU sockets?
Anyway, if cycleclock
defines a global clock, I agree we can use it for this purpose. But I think we also need to have it at the Julia level so that Julia packages like Dagger can interpret the clock with the event information they collected.
For
jl_hrtime
, from a quick@btime time_ns()
, it seems to take about 1 microsecond, which is not huge, but also not tiny.
Is microsecond a typo? Or is it very system-dependent? It's about 30 ns for me (on amdci2).
Footnotes
-
https://www.felixcloutier.com/x86/rdtscp and https://www.felixcloutier.com/x86/rdtsc mention when LFENCE and MFENCE are required. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's a musl or an AMD thing, but on my system:
6/7> @benchmark time_ns()
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.145 μs … 8.388 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.467 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.481 μs ± 206.673 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇▆█▆
▂▁▂▂▂▂▂▂▃▃▃▇▆████▇▇▃▄▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
1.15 μs Histogram: frequency by time 2.37 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, that's rather slow. So maybe we need jl_cycleclock
? It's nice to have a fast "logical" clock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a final decision on this? I'd like for this to make it into 1.8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, maybe we can merge this as-is and consider changing the clock source later?
This broke analyzegc.
|
This reverts commit bfc9431.
|
…liaLang#44158) This reverts commit bfc9431.
…liaLang#44158) This reverts commit bfc9431.
…liaLang#44158) This reverts commit bfc9431.
This makes it possible to track when and on what task an allocation occurred, similar to what we have for the time-based profiler.
EDIT: The motivation for this change is to support consumers like Dagger.jl, which will want to pair this data with its own collected events (which record nanosecond-resolution timestamps as well as task pointers) and task runtime data from #43453 to create a cohesive view of allocations within a distributed, multithreaded application.