-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: runtime.DedicateOSThread() #2180
Conversation
3e6abdc
to
dc8160a
Compare
- New goroutines created by goroutines with locked Ps are enqueued on the | ||
global run queue rather than the invoking P's local run queue. | ||
|
||
While gVisor's use case does not strictly require that the association is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud: what if threads could be unbound prior to blocking in task.Block? Presumably the state of those goroutines do not require a dedicated system thread. When unblocked, they could rebind to the system thread. The main probably with this is churning P's. (E.g. the P will go from "bound" to "unbound" and back to "bound.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More specifically, I think the main problem would be that, since a bound P necessarily doesn't count against GOMAXPROCS, binding a P may force the wakeup of a new M, and unbinding a P may force the M to block until it can obtain an unbound P. (I've now noted this in the "alternatives considered" section.) So this is something we could try, but I don't think it'd be a clear win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a general concern for sure, but in our case I think that we would have plenty of space for a P to swap with an "unbound" P if that was a fast operation. (Since most Ps will not be active for us, presumably.)
I think that in general there are two valid models that we could move towards: a) giving application threads something must closer to system threads and eliminate scheduler churn or b) having a fixed number of system threads that are driving application threads and tame some of the runtime interactions in a different way. Either way, we'd need to fix the findrunnable and indirect wakeup churn.
This proposal is very much (a). There are advantages to (a), but also disadvantages. For example, we'd want to more carefully consider thread bombs, pid limits, etc. (We have this problem today, but it's probably something we want to think about for a proper fix.) I'd love to also think about types of (b) more explicitly and perhaps variants of (a)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a general concern for sure, but in our case I think that we would have plenty of space for a P to swap with an "unbound" P if that was a fast operation. (Since most Ps will not be active for us, presumably.)
I'm not sure that's the case; while the majority of goroutines will be task goroutines, there are still plenty of other goroutines that may be running (timer goroutines, the kernel CPU clock ticker, networking stuff, etc.)
having a fixed number of system threads that are driving application threads and tame some of the runtime interactions in a different way
This would be relatively easy to implement (on some platforms), but compared to kernel thread per application thread, I think it's more sensitive to priority inversions due to kernel thread preemption; we'd have to very carefully measure this (latency distribution with CPU antagonists running on the same system).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also be concerned about churn, specifically because we definitely have issues with round-trip latency in applications with very short-lived futex (or other) sleeps. Making Task.Block more expensive is the opposite of what we want.
That said, it would hopefully be easy to experiment with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that's the case; while the majority of goroutines will be task goroutines, there are still plenty of other goroutines that may be running (timer goroutines, the kernel CPU clock ticker, networking stuff, etc.)
We have lots of additional goroutines, but most are active only briefly (and could presumably be supported by a relatively small number of P's). Note that re: networking, all the per-endpoint goroutines are going away at some point.
This would be relatively easy to implement (on some platforms), but compared to kernel thread per application thread, I think it's more sensitive to priority inversions due to kernel thread preemption; we'd have to very carefully measure this (latency distribution with CPU antagonists running on the same system).
One interesting consequence of the Sentry's design is that I would think priority inversion may exist, but presumably the consequences would be less severe. Given that everything will boil down to futexes at some point, the scheduler should be able to do the right thing with some minor penalty, but there would be no extreme edge cases (e.g. some thread acquiring a spinlock, being preempted). I'd be curious to hear if you had worse-case examples in mind.
I'd also be concerned about churn, specifically because we definitely have issues with round-trip latency in applications with very short-lived futex (or other) sleeps. Making Task.Block more expensive is the opposite of what we want.
Yes, it couldn't be much more expensive. I think there are parts of the design space that would still keep task.Block relatively cheap (esp. if coupled with core runtime improvements to e.g. reduce the findrunnable cost).
dc8160a
to
965f213
Compare
965f213
to
e9becb3
Compare
e9becb3
to
ea5e9ec
Compare
ea5e9ec
to
79e9682
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't dived into to the runtime specifics, but I like this idea overall. I think we may want a prototype and benchmarks to go along with a proposal (something I can take on), since this seems to likely to face more API complexity concerns than implementation complexity concerns.
- New goroutines created by goroutines with locked Ps are enqueued on the | ||
global run queue rather than the invoking P's local run queue. | ||
|
||
While gVisor's use case does not strictly require that the association is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also be concerned about churn, specifically because we definitely have issues with round-trip latency in applications with very short-lived futex (or other) sleeps. Making Task.Block more expensive is the opposite of what we want.
That said, it would hopefully be easy to experiment with.
79e9682
to
b71b3e6
Compare
b71b3e6
to
1ee4c3c
Compare
1ee4c3c
to
941de73
Compare
941de73
to
370bfdf
Compare
|
||
Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the | ||
invoking G to a M), but additionally locks the invoking M to a P. Ps locked by | ||
`DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assume that P's id is less than GOMAXPROCS? If it's true, then we may need to count the locked Ps in GOMAXPROCS. Currently, sync/pool.go is assuming this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd have to change sync.Pool to not make this assumption.
invoking G to a M), but additionally locks the invoking M to a P. Ps locked by | ||
`DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual | ||
number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number | ||
of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are many Ps, it looks that it could be very expensive for GC to run the safePointFn and stop the world. Is this expected? Or is there any optimization can be done here?
370bfdf
to
5d5a1b3
Compare
Updates #2184 PiperOrigin-RevId: 301498467
5d5a1b3
to
bd4ec85
Compare
I have implemented a quick PoC [1][2] for this proposal. Hope it will help. In this PoC, the GC background mark workers are not created for the locked Ps. We may need to think about how to do the garbage collection efficiently for the locked Ps. [1] https://github.com/btw616/go/tree/DedicateOSThread/PoC |
This pull request is stale because it has been open 90 days with no activity. Remove the stale label or comment or this will be closed in 30 days. |
RFC: runtime.DedicateOSThread()
Updates #2184