Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make GC counters thread-local #32217

Merged
merged 1 commit into from
Jun 11, 2019
Merged

make GC counters thread-local #32217

merged 1 commit into from
Jun 11, 2019

Conversation

JeffBezanson
Copy link
Member

@JeffBezanson JeffBezanson commented Jun 1, 2019

This should fix #31923. I see much less memory growth in alloc-heavy threaded loops.

Also #27173

@JeffBezanson JeffBezanson added multithreading Base.Threads and related functionality GC Garbage collector bugfix This change fixes an existing bug labels Jun 1, 2019
@Keno
Copy link
Member

Keno commented Jun 1, 2019

This seems slightly problematic, because the contention on these variables is likely to be high. Can we do something like keep thread-local counters and every n MB of allocation atomically update the global counters?

@JeffBezanson
Copy link
Member Author

Agreed. Making gc counters thread-local has been on my list. Some of the counters also seem slightly unnecessary, e.g. the number of malloc calls.

Copy link
Contributor

@yuyichao yuyichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the counter has to be thread local. Also, I believe for any reasonable allocation measurement experience, the global one needs to be updated when someone query gc_num.

@JeffBezanson
Copy link
Member Author

for any reasonable allocation measurement experience, the global one needs to be updated when someone query gc_num

Yes, of course. I'm thinking we can return a copy of gc_num with all the per-thread counters added in, just with whatever snapshot the calling thread happens to see. Do you think that's good enough?

@0
Copy link
Contributor

0 commented Jun 3, 2019

I can't tell if this has already been addressed in the above comments, so apologies for the noise if it has. There seems to be some sort of race condition in this patch. If I run

@time Threads.@threads for _ in 1:Threads.nthreads()
    zeros(10^8)
end

with JULIA_NUM_THREADS=16 on commit 5923179, it usually finishes in about 5 seconds. Occasionally, though, it gets stuck (possibly indefinitely, but at least for several hours) with an arbitrary fraction of the threads at 100% CPU utilization. I can't reproduce this on commit 123ff48 (the parent of 5923179).

In case this is useful, here's a snapshot of where the threads are according to gdb (with OPENBLAS_NUM_THREADS=1 so there are fewer irrelevant threads):

Top of stack Topmost Julia function Count
__GI___sigtimedwait signal_listener 1
__GI_epoll_pwait jl_task_get_next 1
jl_safepoint_wait_gc jl_safepoint_wait_gc 10
_mm_pause jl_safepoint_wait_gc 4
_mm_pause jl_gc_wait_for_the_world 1

@JeffBezanson
Copy link
Member Author

Thanks for that summary of the backtraces. Very convenient presentation of the info!

@JeffBezanson JeffBezanson changed the title make GC counters atomic make GC counters thread-local Jun 4, 2019
@JeffBezanson
Copy link
Member Author

OK, I pushed the thread-local version. One interesting question is what the exact intervals should be between atomic updates of the global counter. I tried to use a different interval on each thread, but perhaps it's better just to pick a constant.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 4, 2019

I think the best way is to make a guess of the next interval per thread during gc and once that's triggered we can do a sync and decide what to do

@StefanKarpinski
Copy link
Member

Random?

static inline int maybe_collect(jl_ptls_t ptls)
{
if (should_collect() || gc_debug_check_other()) {
int should_collect = 0;
if (ptls->gc_num.allocd >= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function need to use atomic relaxed load.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the line number shows up really weird.... The comment was for combine_thread_gc_counts.

(Actually, to avoid UB, all access of these outside of the collection phase should be atomic.)

src/gc.c Outdated
int64_t intvl = per_thread_counter_interval(ptls->tid, gc_num.interval);
size_t localbytes = ptls->gc_num.allocd + intvl;
ptls->gc_num.allocd = -intvl;
jl_atomic_fetch_add(&gc_num.allocd, localbytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jl_atomic_fetch_add should return the resulting value which is the right thing to use for the comparison on the next line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's add_fetch? But yes we should use it.

@JeffBezanson
Copy link
Member Author

I think the best way is to make a guess of the next interval per thread during gc and once that's triggered we can do a sync and decide what to do

Does that basically mean: keep a per-thread allocation count and a per-thread interval, and use the amount allocated by a thread since the last collection as its next interval value?

@JeffBezanson
Copy link
Member Author

Note: this needs #32238 first.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 7, 2019

Does that basically mean: keep a per-thread allocation count and a per-thread interval, and use the amount allocated by a thread since the last collection as its next interval value?

Yes. I think the logic in the allocation should be as simple as possible. I feel like the worst case for this is when the allocation pattern changes a lot between thread but even in that case I think it shouldn't be too bad either. We'll certainly put some limit on it

Another thing is that I think we can limit sync to page change since that's the slower path anyway.

@JeffBezanson JeffBezanson merged commit 5335a94 into master Jun 11, 2019
@JeffBezanson JeffBezanson deleted the jb/atomicgccount branch June 11, 2019 02:33
@staticfloat
Copy link
Member

In a fascinating turn of events, this PR is the first one to have something caught by the analyzegc run! Unfortunately, this PR was too old to have the new version LLVM that would actually error out on finding an issue, but hopefully this is simple enough:

/buildworker/worker/analyzegc_linux64/build/src/gc.c:1023:5: note: Calling potential safepoint from function annotated JL_NOTSAFEPOINT
    combine_thread_gc_counts(&gc_num);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.

Is this a false alarm, or an actual issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug GC Garbage collector multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large memory leak when using threads
7 participants