-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: select on a shared channel is slow with many Ps #20351
Comments
The root problem is lock contention in |
An MCS lock can scale a spinlock, but that's not what this is. Though I suppose maybe it could be, it's not long we ever hold channels locked. One thing that might speed up this test case would be to scan all the channels without locking them to see if they are ready. Then lock only the ones that are ready, and check them again. But that will only help the real code if there is usually some ready channel. If the select usually has to wait, it won't help, because we need to queue the goroutine on each channel. Even if we use a spinlock that queuing step is going to be a point of contention. |
I think the usage pattern is a bit more accurately described with this benchmark: package selbench
import (
"fmt"
"math/rand"
"testing"
"time"
)
func BenchmarkSelectContended(b *testing.B) {
for workers := 1; workers <= 100000; workers *= 10 {
b.Run(fmt.Sprintf("workers=%d", workers), func(b *testing.B) {
done := make(chan struct{})
// Spin up workers, each with their own work channel.
wc := make([]chan struct{}, workers)
for i := range wc {
c := make(chan struct{})
go func() {
for {
select {
case <-done:
return
case <-c:
}
}
}()
wc[i] = c
}
b.ResetTimer()
// Have the workers do b.N units of work.
b.RunParallel(func(pb *testing.PB) {
src := rand.New(rand.NewSource(time.Now().UnixNano()))
for pb.Next() {
wc[src.Intn(len(wc))] <- struct{}{}
}
})
b.StopTimer()
close(done)
})
}
} For this, I see a total lack of scaling at small numbers of workers and a performance cliff as the number of workers grows large.
|
@josharian, that |
@bradfitz sheesh. Thanks. Updated in situ above. Results are unchanged, though.
|
Cc @dvyukov |
Running a benchmark with fast iterations and heavy setup for 1 iteration does not make sense. The numbers at the bottom is setup cost, not the cost of chan operation. Run them for more iterations. |
@dvyukov, see also the original benchmark at top. But ignoring specific benchmarks (flawed or not), do you have any thoughts? |
@josharian @dvyukov the addition of a waitgroup fixes the 100k worker case:
This gives me the following on a 4 core i7 (w/ hyper threading):
|
Lock-free channels help to some degree (#8899): tip: 3c3be62 (base for lock-free chans): 3c3be62 with lock-free chans: |
@bradfitz What is that shared chan in http? Shouldn't Served.Shutdown do closeDoneChanLocked before closeListenersLocked? Otherwise error handling logic in Server.Serve is broken. I am not sure if there will a temp error on listener close or not, but bad both ways. |
The shared chan is gracefulShutdownCh in x/net/http2. |
How permanent there connections? FWIW we could create a per-conn done chan, add it to a global list of done chans once and then select on it. Shutdown will then need to go over all registered chans and close all of them. |
We already did that :) See the CLs in #20302. This bug is about finding a more general solution. |
What is that |
Starting a goroutine for |
We're getting off topic. The general problem this bug is trying to address is demonstrated by the benchmark in Brad's original comment. Edit: fixed CL link. By "that last CL", I meant https://golang.org/cl/43230. |
Yes, let's use the right bugs for the right discussion. This bug is about the general problem in the Go runtime. #20302 is about working around the problem in http/http2. kubernetes/kubernetes#45216 is the original Kubernetes speed regression/rollback bug. |
I think this case is inherently hard. Multiple threads are hammering the same complex mutex-protected object. When memory is write-shared a simple memory load can take up to 100ns. And the mutex makes it all worse. The numbers are need to be interpreted carefully. We divide work across multiple threads. So this:
rescaled to op/per-core is actually (assuming that the machine does have 64 cores):
so in some cases we see up to 100x slowdown. There is no easy way to fix this entirely. That would require some monstrous distributed construct that consumes considerably more memory and then some logic to detect when a chan actually falls into this case and dynamically alter the algorithm to a different scheme, and then alter it back if we see that we misguessed (e.g. sends appear of what we though is a close-notification chan). I don't think it's worth it. At least we have lots of lower hanging fruits. Now, what I think we should do to improve things to some degree is:
And a more realistic benchmark for this case would be to add a bunch of local non-ready chans to the select, because that's what http2 does, and these additional chans significantly prolong critical section in select. |
For the reference: fine-grained locking for select is #8896. |
…s non-rpc messages in a bounded worker pool
I am experimenting with generalizing the single-channel select optimizations done by the compiler to multiple channels. A quick prototype implemented in the runtime looks promising, but before spending too much time on it (there's one regression that needs work) I wanted to ask for help in understanding if the approach is, in general, ok from the perspective of the language spec. I tried to ask on gophers slack, but we could not find a good argument as to why this shouldn't be allowed. The gist of the idea is to try to avoid locking all channels, if we detect that any of the channels is ready. Basically, are the following transformations allowed by the spec? // non-blocking case
select {
case <-f1():
case <-f2():
default:
// body default
} to c1, c2 := f1(), f2()
if randbool() {
c1, c2 = c2, c1
}
select {
case <-c1:
default:
select {
case <-c2:
default:
// body default
}
} and // blocking case
select {
case <-f1():
case <-f2():
} to c1, c2 := f1(), f2()
if randbool() {
c1, c2 = c2, c1
}
select {
case <-c1:
default:
select {
case <-c2:
default:
select {
case <-c1:
case <-c2:
}
}
} The one I'm least confident about is the non-blocking case, as I'm specifically not sure whether picking the default case in that way is correct. This could possibly be a safer, but slower, approach: c1, c2 := f1(), f2()
if randbool() {
c1, c2 = c2, c1
}
select {
case <-c1:
default:
select {
case <-c2:
default:
select {
case <-c1:
case <-c2:
default:
// body default
}
}
} |
@CAFxX This issue is about a specific problem that I don't think is addressed by your proposal. So this issue is not the best place to discuss your idea. The best place to discuss it would be the golang-dev mailing list. Thanks. |
Sure, will do, although it is definitely addressing the problem of lock contention (as it reduces the number of sellock/selunlock calls, and therefore the total number of lock2/unlock2 calls, except for the blocking case in which no channel is ready) and it should help with both bradfitz's and josharian's benchmarks (although I haven't run them yet). FWIW, you actually proposed almost the same idea in your first comment: #20351 (comment) update: asked on https://groups.google.com/forum/#!topic/golang-dev/jX4oQEls3uk |
Honestly I don't see my suggestion as the same as yours, since I was suggesting looking at all the channels without taking any locks, which is not what your code does. I may well be missing something. |
@tombergan and I have been debugging a severe performance regression that Kubernetes observed when switching from Go 1.7 to Go 1.8. The problem ended up being the addition of
net/http.Server.Shutdown
that's currently implemented by closing a channel that all the open connections select on.(Details in kubernetes/kubernetes#45216 and #20302)
But the short summary is:
Note that the idle channels below are never closed and are never selectable, but the other two are, and stay busy.
But when the channel is private to the goroutine (uncontended), things are fast. When it's a shared channel, things are slow.
Are there any optimizations to be done here?
We'll work around it in the meantime and generally keep this in mind as an anti-pattern for performance.
/cc @aclements @randall77 @ianlancetaylor @rsc @josharian @mdempsky
The text was updated successfully, but these errors were encountered: