-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rfc] parallel marking #45639
[rfc] parallel marking #45639
Conversation
can we get 32 core benchmarks as well? It would be very unfortunate if this was a regression for cpus with many cores. (it would be fine to limit GC to use 4 if thats all it can use, but negative scaling is very unfortunate) |
(EDIT: see scaling plots below) |
Were these timings made on a version of Julia that includes the changes #45714 that make the GC time report correctly? |
No. Will merge and update that. |
afac580
to
5414eb5
Compare
Something went wrong when running your job:
Logs and partial data can be found here |
@nanosoldier |
@nanosoldier |
This comment was marked as off-topic.
This comment was marked as off-topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I ran the workspace on several examples and it meet my criteria of getting better performance with 2 parallel threads than with our current single threaded solution.
The changes look reasonable to me.
This is a huge disruptive change and I think we need to think long and hard about whether we want parallel marking to be our only GC algorithm or whether we want users to be able to choose at run time.
Were there any regressions? Can you post the before/after? |
fb4da4a
to
04e3187
Compare
@nanosoldier |
## Previous work Since #21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and proceed with marking the reference we just found. ## This PR This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of `jl_value_t *` currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking. I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the `remset`, and we ensure this by calling `gc_mark_outrefs` in `gc_queue_remset` with `meta_updated` set to 1. ## Additional advantages 1. There are no recursive function calls in the GC mark-loop code (one of the reasons why #21590 was implemented). 2. Keeping a single GC queue will **greatly** simplify work-stealing in the multi-threaded GC we are working on (c.f. #45639). 3. Arrays of references, for example, are now marked on a regular stride fashion, which could help with hardware prefetching. 4. We can easily modify the traversal mode (to breadth first, for example) by only changing the `jl_gc_markqueue_t`(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future. Since this PR changes the mark-loop graph traversal code, there are some changes in the heap-snapshot, though I'm not familiar with that PR. Some benchmark results are here: https://hackmd.io/@Idnmfpb3SxK98-OsBtRD5A/H1V6QSzvs.
What is the status of this now that #21590 was merged? |
I think it's not ready to merge/review yet. It's still causing regressions on some of the GCBenchmarks and I'm diagnosing that (will post some scaling plots in a sec). I think @kpamnany also saw some segfaults on RAI tests. |
Superseded by #48600. |
This PR extends the refactoring from #45608 by parallelizing the GC mark-loop.
TODO
CC: @vchuravy, @chflood, @kpamnany