-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make GC counters thread-local #32217
Conversation
This seems slightly problematic, because the contention on these variables is likely to be high. Can we do something like keep thread-local counters and every n MB of allocation atomically update the global counters? |
Agreed. Making gc counters thread-local has been on my list. Some of the counters also seem slightly unnecessary, e.g. the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the counter has to be thread local. Also, I believe for any reasonable allocation measurement experience, the global one needs to be updated when someone query gc_num.
Yes, of course. I'm thinking we can return a copy of |
I can't tell if this has already been addressed in the above comments, so apologies for the noise if it has. There seems to be some sort of race condition in this patch. If I run @time Threads.@threads for _ in 1:Threads.nthreads()
zeros(10^8)
end with In case this is useful, here's a snapshot of where the threads are according to gdb (with
|
Thanks for that summary of the backtraces. Very convenient presentation of the info! |
5923179
to
38eb516
Compare
OK, I pushed the thread-local version. One interesting question is what the exact intervals should be between atomic updates of the global counter. I tried to use a different interval on each thread, but perhaps it's better just to pick a constant. |
I think the best way is to make a guess of the next interval per thread during gc and once that's triggered we can do a sync and decide what to do |
Random? |
static inline int maybe_collect(jl_ptls_t ptls) | ||
{ | ||
if (should_collect() || gc_debug_check_other()) { | ||
int should_collect = 0; | ||
if (ptls->gc_num.allocd >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function need to use atomic relaxed load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the line number shows up really weird.... The comment was for combine_thread_gc_counts
.
(Actually, to avoid UB, all access of these outside of the collection phase should be atomic.)
src/gc.c
Outdated
int64_t intvl = per_thread_counter_interval(ptls->tid, gc_num.interval); | ||
size_t localbytes = ptls->gc_num.allocd + intvl; | ||
ptls->gc_num.allocd = -intvl; | ||
jl_atomic_fetch_add(&gc_num.allocd, localbytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jl_atomic_fetch_add
should return the resulting value which is the right thing to use for the comparison on the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's add_fetch
? But yes we should use it.
Does that basically mean: keep a per-thread allocation count and a per-thread interval, and use the amount allocated by a thread since the last collection as its next interval value? |
Note: this needs #32238 first. |
Yes. I think the logic in the allocation should be as simple as possible. I feel like the worst case for this is when the allocation pattern changes a lot between thread but even in that case I think it shouldn't be too bad either. We'll certainly put some limit on it Another thing is that I think we can limit sync to page change since that's the slower path anyway. |
38eb516
to
62a6e4e
Compare
62a6e4e
to
a02f1b3
Compare
In a fascinating turn of events, this PR is the first one to have something caught by the
Is this a false alarm, or an actual issue? |
This should fix #31923. I see much less memory growth in alloc-heavy threaded loops.
Also #27173