Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] parallel marking #45639

Closed
wants to merge 84 commits into from
Closed

[rfc] parallel marking #45639

wants to merge 84 commits into from

Conversation

d-netto
Copy link
Member

@d-netto d-netto commented Jun 10, 2022

This PR extends the refactoring from #45608 by parallelizing the GC mark-loop.

TODO

  • Move work-stealing queue implementation to separate file.
  • Run more benchmarks (and test scalability for large number of threads).
  • Fix GC debugging infrastructure & annotations.

CC: @vchuravy, @chflood, @kpamnany

@oscardssmith
Copy link
Member

oscardssmith commented Jun 10, 2022

can we get 32 core benchmarks as well? It would be very unfortunate if this was a regression for cpus with many cores. (it would be fine to limit GC to use 4 if thats all it can use, but negative scaling is very unfortunate)

@jpsamaroo jpsamaroo added the GC Garbage collector label Jun 11, 2022
@d-netto d-netto mentioned this pull request Jun 12, 2022
3 tasks
@d-netto
Copy link
Member Author

d-netto commented Jun 17, 2022

(EDIT: see scaling plots below)

@oscardssmith
Copy link
Member

Were these timings made on a version of Julia that includes the changes #45714 that make the GC time report correctly?

@d-netto
Copy link
Member Author

d-netto commented Jun 17, 2022

No. Will merge and update that.

@d-netto d-netto force-pushed the dcn/pmark2 branch 7 times, most recently from afac580 to 5414eb5 Compare June 26, 2022 21:40
@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo -n /nanosoldier/cset/bin/cset shield -e -- sudo -n -u nanosoldier-worker -- /nanosoldier/workdir/jl_Ru6OqQ/benchscript.sh`, ProcessSignaled(6)) [0]

Logs and partial data can be found here

@vchuravy
Copy link
Member

vchuravy commented Aug 3, 2022

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@d-netto
Copy link
Member Author

d-netto commented Aug 5, 2022

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@vtjnash

This comment was marked as off-topic.

@d-netto d-netto mentioned this pull request Aug 13, 2022
Copy link
Member

@chflood chflood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I ran the workspace on several examples and it meet my criteria of getting better performance with 2 parallel threads than with our current single threaded solution.

The changes look reasonable to me.

This is a huge disruptive change and I think we need to think long and hard about whether we want parallel marking to be our only GC algorithm or whether we want users to be able to choose at run time.

@oscardssmith
Copy link
Member

Were there any regressions? Can you post the before/after?

@d-netto d-netto force-pushed the dcn/pmark2 branch 3 times, most recently from fb4da4a to 04e3187 Compare September 8, 2022 00:38
@vchuravy
Copy link
Member

vchuravy commented Sep 9, 2022

@nanosoldier runtests(ALL, vs = ":master", buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"], vs_buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"])

@d-netto d-netto mentioned this pull request Nov 28, 2022
@d-netto d-netto closed this Jan 23, 2023
@d-netto d-netto reopened this Jan 24, 2023
vtjnash pushed a commit that referenced this pull request Jan 24, 2023
## Previous work

Since #21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and  proceed with marking the reference we just found.

## This PR

This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of `jl_value_t *` currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.

I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the `remset`, and we ensure this by calling `gc_mark_outrefs` in `gc_queue_remset` with `meta_updated` set to 1.

## Additional advantages

1. There are no recursive function calls in the GC mark-loop code (one of the reasons why #21590 was implemented).
2. Keeping a single GC queue will **greatly** simplify work-stealing in the multi-threaded GC we are working on (c.f. #45639).
3. Arrays of references, for example, are now marked on a regular stride fashion, which could help with hardware prefetching.
4. We can easily modify the traversal mode (to breadth first, for example) by only changing the `jl_gc_markqueue_t`(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future.

Since this PR changes the mark-loop graph traversal code, there are some changes in the heap-snapshot, though I'm not familiar with that PR.

Some benchmark results are here: https://hackmd.io/@Idnmfpb3SxK98-OsBtRD5A/H1V6QSzvs.
@vtjnash
Copy link
Member

vtjnash commented Feb 2, 2023

What is the status of this now that #21590 was merged?

@d-netto
Copy link
Member Author

d-netto commented Feb 2, 2023

I think it's not ready to merge/review yet. It's still causing regressions on some of the GCBenchmarks and I'm diagnosing that (will post some scaling plots in a sec).

I think @kpamnany also saw some segfaults on RAI tests.

@d-netto
Copy link
Member Author

d-netto commented Feb 2, 2023

Speedup/slowdowns on GC times relative to the master commit from which it branched vs nthreads (1, 2, 4, 8, 16):

positive-gc-scaling
neutral-gc-scaling
negative-gc-scaling

@d-netto
Copy link
Member Author

d-netto commented Feb 19, 2023

Superseded by #48600.

@d-netto d-netto closed this Feb 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants