Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: runtime.DedicateOSThread() #2180

Closed
wants to merge 1 commit into from
Closed

Conversation

copybara-service[bot]
Copy link

@copybara-service copybara-service bot commented Mar 18, 2020

RFC: runtime.DedicateOSThread()

Updates #2184

@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 18, 2020
@googlebot googlebot added the cla: yes CLA has been signed label Mar 18, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 18, 2020
@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 18, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 18, 2020
- New goroutines created by goroutines with locked Ps are enqueued on the
global run queue rather than the invoking P's local run queue.

While gVisor's use case does not strictly require that the association is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: what if threads could be unbound prior to blocking in task.Block? Presumably the state of those goroutines do not require a dedicated system thread. When unblocked, they could rebind to the system thread. The main probably with this is churning P's. (E.g. the P will go from "bound" to "unbound" and back to "bound.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, I think the main problem would be that, since a bound P necessarily doesn't count against GOMAXPROCS, binding a P may force the wakeup of a new M, and unbinding a P may force the M to block until it can obtain an unbound P. (I've now noted this in the "alternatives considered" section.) So this is something we could try, but I don't think it'd be a clear win.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a general concern for sure, but in our case I think that we would have plenty of space for a P to swap with an "unbound" P if that was a fast operation. (Since most Ps will not be active for us, presumably.)

I think that in general there are two valid models that we could move towards: a) giving application threads something must closer to system threads and eliminate scheduler churn or b) having a fixed number of system threads that are driving application threads and tame some of the runtime interactions in a different way. Either way, we'd need to fix the findrunnable and indirect wakeup churn.

This proposal is very much (a). There are advantages to (a), but also disadvantages. For example, we'd want to more carefully consider thread bombs, pid limits, etc. (We have this problem today, but it's probably something we want to think about for a proper fix.) I'd love to also think about types of (b) more explicitly and perhaps variants of (a)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a general concern for sure, but in our case I think that we would have plenty of space for a P to swap with an "unbound" P if that was a fast operation. (Since most Ps will not be active for us, presumably.)

I'm not sure that's the case; while the majority of goroutines will be task goroutines, there are still plenty of other goroutines that may be running (timer goroutines, the kernel CPU clock ticker, networking stuff, etc.)

having a fixed number of system threads that are driving application threads and tame some of the runtime interactions in a different way

This would be relatively easy to implement (on some platforms), but compared to kernel thread per application thread, I think it's more sensitive to priority inversions due to kernel thread preemption; we'd have to very carefully measure this (latency distribution with CPU antagonists running on the same system).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also be concerned about churn, specifically because we definitely have issues with round-trip latency in applications with very short-lived futex (or other) sleeps. Making Task.Block more expensive is the opposite of what we want.

That said, it would hopefully be easy to experiment with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's the case; while the majority of goroutines will be task goroutines, there are still plenty of other goroutines that may be running (timer goroutines, the kernel CPU clock ticker, networking stuff, etc.)

We have lots of additional goroutines, but most are active only briefly (and could presumably be supported by a relatively small number of P's). Note that re: networking, all the per-endpoint goroutines are going away at some point.

This would be relatively easy to implement (on some platforms), but compared to kernel thread per application thread, I think it's more sensitive to priority inversions due to kernel thread preemption; we'd have to very carefully measure this (latency distribution with CPU antagonists running on the same system).

One interesting consequence of the Sentry's design is that I would think priority inversion may exist, but presumably the consequences would be less severe. Given that everything will boil down to futexes at some point, the scheduler should be able to do the right thing with some minor penalty, but there would be no extreme edge cases (e.g. some thread acquiring a spinlock, being preempted). I'd be curious to hear if you had worse-case examples in mind.

I'd also be concerned about churn, specifically because we definitely have issues with round-trip latency in applications with very short-lived futex (or other) sleeps. Making Task.Block more expensive is the opposite of what we want.

Yes, it couldn't be much more expensive. I think there are parts of the design space that would still keep task.Block relatively cheap (esp. if coupled with core runtime improvements to e.g. reduce the findrunnable cost).

@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 18, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 18, 2020
@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 19, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 19, 2020
@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 19, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 19, 2020
@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 19, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 19, 2020
Copy link
Member

@prattmic prattmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't dived into to the runtime specifics, but I like this idea overall. I think we may want a prototype and benchmarks to go along with a proposal (something I can take on), since this seems to likely to face more API complexity concerns than implementation complexity concerns.

- New goroutines created by goroutines with locked Ps are enqueued on the
global run queue rather than the invoking P's local run queue.

While gVisor's use case does not strictly require that the association is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also be concerned about churn, specifically because we definitely have issues with round-trip latency in applications with very short-lived futex (or other) sleeps. Making Task.Block more expensive is the opposite of what we want.

That said, it would hopefully be easy to experiment with.

@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 24, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 24, 2020
@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 24, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 24, 2020
@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 27, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 27, 2020
@copybara-service copybara-service bot added the kokoro:run Presubmits may be run label Mar 27, 2020
@gvisor-bot gvisor-bot removed the kokoro:run Presubmits may be run label Mar 27, 2020

Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the
invoking G to a M), but additionally locks the invoking M to a P. Ps locked by
`DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assume that P's id is less than GOMAXPROCS? If it's true, then we may need to count the locked Ps in GOMAXPROCS. Currently, sync/pool.go is assuming this:

https://github.com/golang/go/blob/7c0ee1127bf41bf274b08170de3e42b171a903c0/src/sync/pool.go#L225-L230

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd have to change sync.Pool to not make this assumption.

invoking G to a M), but additionally locks the invoking M to a P. Ps locked by
`DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual
number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number
of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are many Ps, it looks that it could be very expensive for GC to run the safePointFn and stop the world. Is this expected? Or is there any optimization can be done here?

Updates #2184

PiperOrigin-RevId: 301498467
@btw616
Copy link
Contributor

btw616 commented Apr 28, 2020

I have implemented a quick PoC [1][2] for this proposal. Hope it will help. In this PoC, the GC background mark workers are not created for the locked Ps. We may need to think about how to do the garbage collection efficiently for the locked Ps.

[1] https://github.com/btw616/go/tree/DedicateOSThread/PoC
[2] https://github.com/btw616/gvisor/tree/golang/DedicateOSThread/PoC

@github-actions
Copy link

This pull request is stale because it has been open 90 days with no activity. Remove the stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale The Issue or PR is stale. label Jul 28, 2020
@github-actions github-actions bot closed this Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes CLA has been signed stale The Issue or PR is stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants