Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile.Allocs: Add task and timestamp #44055

Merged
merged 3 commits into from
Feb 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion src/gc-alloc-profiler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ struct jl_raw_alloc_t {
jl_datatype_t *type_address;
jl_raw_backtrace_t backtrace;
size_t size;
jl_task_t *task;
uint64_t timestamp;
};

// == These structs define the global singleton profile buffer that will be used by
Expand Down Expand Up @@ -133,7 +135,9 @@ void _maybe_record_alloc_to_profile(jl_value_t *val, size_t size) JL_NOTSAFEPOIN
profile.allocs.emplace_back(jl_raw_alloc_t{
type,
get_raw_backtrace(),
size
size,
jl_current_task,
cycleclock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to use jl_hrtime aka Base.time_ns? Since the task pointer does not uniquely identify the task (as GC can recycle it), you'd need to correlate the allocation event with the window defined by the creation and finish times of the task to recover the accurate task identity. To join this data with the event data collected on the user-side Julia program (e.g., Dagger), it sounds more useful to use the clock available in Julia. Also, https://en.wikipedia.org/wiki/Time_Stamp_Counter suggests rdtsc used for cycleclock on x86_64 does not seem to work as a global clock with different CPUs.

A bit aside, but I wonder if it's better to have a "true" task ID that is guaranteed be unique within a single process (e.g., prepare const TASKIDS = [1:nthreads();] and then issue TASKIDS[threadid()] += nthreads() in the task constructor in C). Although it's a bit annoying to have 64 more bits if you need it only for profiling...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointers do uniquely identify an object. That is the definition of a pointer

rdtsc is synced by the GC, but we could use the rdtscp version too if we must

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash pointers don't uniquely identify objects that have been freed over a session's entire life, which is something that I'll probably need to deal with eventually.

rdtsc is synced by the GC, but we could use the rdtscp version too if we must

For my understanding, can you point out where this happens in the GC code?

For jl_hrtime, from a quick @btime time_ns(), it seems to take about 1 microsecond, which is not huge, but also not tiny.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I put cycleclock() in there due to the runtime profiler using it as well. Maybe that's OK for that profiler because we get the thread ID that the stack was recorded on? Although if rtdsc is per-CPU core, it can be unreliable as a thread gets scheduled on a different core (which I can account for with BPF probes, but it'd be nice to not have to do that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the case decades ago for rdtsc, but the hardware has long since been corrected

Copy link
Member

@tkf tkf Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. Does it mean that, with

ref = Threads.Atomic{UInt64}(0)
@spawn ref[] = cycleclock()
@spawn ok = ref[] <= cycleclock()

it is guaranteed that ok === true always if appropriate fences are inserted?1 Also, does it provide a global clock even across CPU sockets?

Anyway, if cycleclock defines a global clock, I agree we can use it for this purpose. But I think we also need to have it at the Julia level so that Julia packages like Dagger can interpret the clock with the event information they collected.

For jl_hrtime, from a quick @btime time_ns(), it seems to take about 1 microsecond, which is not huge, but also not tiny.

Is microsecond a typo? Or is it very system-dependent? It's about 30 ns for me (on amdci2).

Footnotes

  1. https://www.felixcloutier.com/x86/rdtscp and https://www.felixcloutier.com/x86/rdtsc mention when LFENCE and MFENCE are required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a musl or an AMD thing, but on my system:

6/7> @benchmark time_ns()
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.145 μs …   8.388 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.467 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.481 μs ± 206.673 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▇▆█▆
  ▂▁▂▂▂▂▂▂▃▃▃▇▆████▇▇▃▄▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.15 μs         Histogram: frequency by time        2.37 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, that's rather slow. So maybe we need jl_cycleclock? It's nice to have a fast "logical" clock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a final decision on this? I'd like for this to make it into 1.8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, maybe we can merge this as-is and consider changing the clock source later?

});
}

Expand Down
8 changes: 7 additions & 1 deletion stdlib/Profile/src/Allocs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ struct RawAlloc
type::Ptr{Type}
backtrace::RawBacktrace
size::Csize_t
task::Ptr{Cvoid}
timestamp::UInt64
end

# matches jl_profile_allocs_raw_results_t on the C side
Expand Down Expand Up @@ -147,6 +149,8 @@ struct Alloc
type::Any
stacktrace::StackTrace
size::Int
task::Ptr{Cvoid}
timestamp::UInt64
end

struct AllocResults
Expand Down Expand Up @@ -180,7 +184,9 @@ function decode_alloc(cache::BacktraceCache, raw_alloc::RawAlloc)::Alloc
Alloc(
load_type(raw_alloc.type),
stacktrace_memoized(cache, load_backtrace(raw_alloc.backtrace)),
UInt(raw_alloc.size)
UInt(raw_alloc.size),
raw_alloc.task,
raw_alloc.timestamp
)
end

Expand Down