Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: select on a shared channel is slow with many Ps #20351

Open
bradfitz opened this issue May 13, 2017 · 24 comments
Open

runtime: select on a shared channel is slow with many Ps #20351

bradfitz opened this issue May 13, 2017 · 24 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance Scalability Issues related to runtime/application scalability
Milestone

Comments

@bradfitz
Copy link
Contributor

@tombergan and I have been debugging a severe performance regression that Kubernetes observed when switching from Go 1.7 to Go 1.8. The problem ended up being the addition of net/http.Server.Shutdown that's currently implemented by closing a channel that all the open connections select on.

(Details in kubernetes/kubernetes#45216 and #20302)

But the short summary is:

package selbench
        
import "testing"
        
func BenchmarkSelectShared(b *testing.B) {
        idleShared := make(chan struct{})
        b.RunParallel(func(pb *testing.PB) {
                ch := make(chan int, 1)
                for pb.Next() {
                        select {
                        case ch <- 1:
                        case <-ch:
                        case <-idleShared:
                        }
                }
        })      
}       

func BenchmarkSelectPrivate(b *testing.B) {
        b.RunParallel(func(pb *testing.PB) {
                ch := make(chan int, 1)
                idlePrivate := make(chan struct{})
                for pb.Next() {
                        select {
                        case ch <- 1:
                        case <-ch:
                        case <-idlePrivate:
                        }
                }
        })
} 

Note that the idle channels below are never closed and are never selectable, but the other two are, and stay busy.

But when the channel is private to the goroutine (uncontended), things are fast. When it's a shared channel, things are slow.

$ go test -v -bench=Select -cpu=1,2,4,8,16,32,64
BenchmarkSelectShared           10000000               194 ns/op
BenchmarkSelectShared-2         10000000               147 ns/op
BenchmarkSelectShared-4          5000000               395 ns/op
BenchmarkSelectShared-8          3000000               449 ns/op
BenchmarkSelectShared-16         5000000               354 ns/op
BenchmarkSelectShared-32         5000000               320 ns/op
BenchmarkSelectShared-64         5000000               296 ns/op
BenchmarkSelectPrivate          10000000               192 ns/op
BenchmarkSelectPrivate-2        20000000                98.0 ns/op
BenchmarkSelectPrivate-4        30000000                49.3 ns/op
BenchmarkSelectPrivate-8        50000000                25.5 ns/op
BenchmarkSelectPrivate-16       100000000               13.8 ns/op
BenchmarkSelectPrivate-32       200000000                7.07 ns/op
BenchmarkSelectPrivate-64       200000000                6.31 ns/op

Are there any optimizations to be done here?

We'll work around it in the meantime and generally keep this in mind as an anti-pattern for performance.

/cc @aclements @randall77 @ianlancetaylor @rsc @josharian @mdempsky

@tombergan
Copy link
Contributor

The root problem is lock contention in runtime.sellock and runtime.selunlock. One idea is to replace runtime.mutex with a scalable lock, such as an MCS lock.

@ianlancetaylor
Copy link
Member

An MCS lock can scale a spinlock, but that's not what this is. Though I suppose maybe it could be, it's not long we ever hold channels locked.

One thing that might speed up this test case would be to scan all the channels without locking them to see if they are ready. Then lock only the ones that are ready, and check them again. But that will only help the real code if there is usually some ready channel. If the select usually has to wait, it won't help, because we need to queue the goroutine on each channel. Even if we use a spinlock that queuing step is going to be a point of contention.

@josharian
Copy link
Contributor

josharian commented May 13, 2017

I think the usage pattern is a bit more accurately described with this benchmark:

package selbench

import (
	"fmt"
	"math/rand"
	"testing"
	"time"
)

func BenchmarkSelectContended(b *testing.B) {
	for workers := 1; workers <= 100000; workers *= 10 {
		b.Run(fmt.Sprintf("workers=%d", workers), func(b *testing.B) {
			done := make(chan struct{})
			// Spin up workers, each with their own work channel.
			wc := make([]chan struct{}, workers)
			for i := range wc {
				c := make(chan struct{})
				go func() {
					for {
						select {
						case <-done:
							return
						case <-c:
						}
					}
				}()
				wc[i] = c
			}
			b.ResetTimer()
			// Have the workers do b.N units of work.
			b.RunParallel(func(pb *testing.PB) {
				src := rand.New(rand.NewSource(time.Now().UnixNano()))
				for pb.Next() {
					wc[src.Intn(len(wc))] <- struct{}{}
				}
			})
			b.StopTimer()
			close(done)
		})
	}
}

For this, I see a total lack of scaling at small numbers of workers and a performance cliff as the number of workers grows large.

$ go test -bench=. -cpu=1,2,4,8,16,32,64 -benchtime=100ms x_test.go 
goos: darwin
goarch: amd64
BenchmarkSelectContended/workers=1            	  500000	       350 ns/op
BenchmarkSelectContended/workers=1-2          	  300000	       417 ns/op
BenchmarkSelectContended/workers=1-4          	  300000	       487 ns/op
BenchmarkSelectContended/workers=1-8          	  300000	       561 ns/op
BenchmarkSelectContended/workers=1-16         	  200000	       623 ns/op
BenchmarkSelectContended/workers=1-32         	  200000	       628 ns/op
BenchmarkSelectContended/workers=1-64         	  200000	       594 ns/op
BenchmarkSelectContended/workers=10           	  500000	       376 ns/op
BenchmarkSelectContended/workers=10-2         	  300000	       513 ns/op
BenchmarkSelectContended/workers=10-4         	  300000	       620 ns/op
BenchmarkSelectContended/workers=10-8         	  300000	       575 ns/op
BenchmarkSelectContended/workers=10-16        	  300000	       533 ns/op
BenchmarkSelectContended/workers=10-32        	  300000	       508 ns/op
BenchmarkSelectContended/workers=10-64        	  300000	       461 ns/op
BenchmarkSelectContended/workers=100          	  300000	       385 ns/op
BenchmarkSelectContended/workers=100-2        	  300000	       443 ns/op
BenchmarkSelectContended/workers=100-4        	  300000	       439 ns/op
BenchmarkSelectContended/workers=100-8        	  300000	       539 ns/op
BenchmarkSelectContended/workers=100-16       	  200000	       594 ns/op
BenchmarkSelectContended/workers=100-32       	  200000	       601 ns/op
BenchmarkSelectContended/workers=100-64       	  200000	       696 ns/op
BenchmarkSelectContended/workers=1000         	  300000	       452 ns/op
BenchmarkSelectContended/workers=1000-2       	  300000	       426 ns/op
BenchmarkSelectContended/workers=1000-4       	  300000	       412 ns/op
BenchmarkSelectContended/workers=1000-8       	  200000	       563 ns/op
BenchmarkSelectContended/workers=1000-16      	  200000	       551 ns/op
BenchmarkSelectContended/workers=1000-32      	  200000	       606 ns/op
BenchmarkSelectContended/workers=1000-64      	  300000	       601 ns/op
BenchmarkSelectContended/workers=10000        	  200000	       894 ns/op
BenchmarkSelectContended/workers=10000-2      	  200000	       691 ns/op
BenchmarkSelectContended/workers=10000-4      	  200000	       597 ns/op
BenchmarkSelectContended/workers=10000-8      	  200000	       788 ns/op
BenchmarkSelectContended/workers=10000-16     	  100000	      1174 ns/op
BenchmarkSelectContended/workers=10000-32     	  200000	       792 ns/op
BenchmarkSelectContended/workers=10000-64     	  100000	      1070 ns/op
BenchmarkSelectContended/workers=100000       	    1000	    157938 ns/op
BenchmarkSelectContended/workers=100000-2     	     200	   1044491 ns/op
BenchmarkSelectContended/workers=100000-4     	      50	   2425451 ns/op
BenchmarkSelectContended/workers=100000-8     	     300	    848236 ns/op
BenchmarkSelectContended/workers=100000-16    	     500	    201122 ns/op
BenchmarkSelectContended/workers=100000-32    	      20	   9354451 ns/op
BenchmarkSelectContended/workers=100000-64    	       1	 187918798 ns/op
PASS
ok  	command-line-arguments	18.974s

@bradfitz
Copy link
Contributor Author

@josharian, that rand.Intn call in there has a mutex (var globalRand = New(&lockedSource{src: NewSource(1).(Source64)})). I'd remove that and use a goroutine-local math/rand.Rand instead.

@josharian
Copy link
Contributor

@bradfitz sheesh. Thanks. Updated in situ above. Results are unchanged, though.

$ go test -bench=. -cpu=1,2,4,8,16,32,64 -benchtime=100ms x_test.go 
BenchmarkSelectContended/workers=1            	  300000	       358 ns/op
BenchmarkSelectContended/workers=1-2          	  300000	       526 ns/op
BenchmarkSelectContended/workers=1-4          	  300000	       552 ns/op
BenchmarkSelectContended/workers=1-8          	  300000	       600 ns/op
BenchmarkSelectContended/workers=1-16         	  200000	       713 ns/op
BenchmarkSelectContended/workers=1-32         	  200000	       674 ns/op
BenchmarkSelectContended/workers=1-64         	  200000	       701 ns/op
BenchmarkSelectContended/workers=10           	  500000	       399 ns/op
BenchmarkSelectContended/workers=10-2         	  300000	       551 ns/op
BenchmarkSelectContended/workers=10-4         	  300000	       585 ns/op
BenchmarkSelectContended/workers=10-8         	  200000	       660 ns/op
BenchmarkSelectContended/workers=10-16        	  300000	       574 ns/op
BenchmarkSelectContended/workers=10-32        	  300000	       551 ns/op
BenchmarkSelectContended/workers=10-64        	  300000	       498 ns/op
BenchmarkSelectContended/workers=100          	  500000	       439 ns/op
BenchmarkSelectContended/workers=100-2        	  300000	       481 ns/op
BenchmarkSelectContended/workers=100-4        	  300000	       494 ns/op
BenchmarkSelectContended/workers=100-8        	  300000	       569 ns/op
BenchmarkSelectContended/workers=100-16       	  200000	       733 ns/op
BenchmarkSelectContended/workers=100-32       	  200000	       650 ns/op
BenchmarkSelectContended/workers=100-64       	  200000	       645 ns/op
BenchmarkSelectContended/workers=1000         	  200000	       583 ns/op
BenchmarkSelectContended/workers=1000-2       	  300000	       467 ns/op
BenchmarkSelectContended/workers=1000-4       	  300000	       492 ns/op
BenchmarkSelectContended/workers=1000-8       	  200000	       633 ns/op
BenchmarkSelectContended/workers=1000-16      	  200000	       651 ns/op
BenchmarkSelectContended/workers=1000-32      	  200000	       645 ns/op
BenchmarkSelectContended/workers=1000-64      	  200000	       708 ns/op
BenchmarkSelectContended/workers=10000        	  100000	      1056 ns/op
BenchmarkSelectContended/workers=10000-2      	  200000	       756 ns/op
BenchmarkSelectContended/workers=10000-4      	  200000	       691 ns/op
BenchmarkSelectContended/workers=10000-8      	  200000	       795 ns/op
BenchmarkSelectContended/workers=10000-16     	  200000	       810 ns/op
BenchmarkSelectContended/workers=10000-32     	  200000	       804 ns/op
BenchmarkSelectContended/workers=10000-64     	  200000	       889 ns/op
BenchmarkSelectContended/workers=100000       	    1000	    134482 ns/op
BenchmarkSelectContended/workers=100000-2     	     500	    208302 ns/op
BenchmarkSelectContended/workers=100000-4     	     100	   1339142 ns/op
BenchmarkSelectContended/workers=100000-8     	       1	 170459942 ns/op
BenchmarkSelectContended/workers=100000-16    	     100	   1204770 ns/op
BenchmarkSelectContended/workers=100000-32    	     300	    573343 ns/op
BenchmarkSelectContended/workers=100000-64    	       1	 188693577 ns/op
PASS
ok  	command-line-arguments	19.280s

@valyala
Copy link
Contributor

valyala commented May 15, 2017

Cc @dvyukov

@dvyukov
Copy link
Member

dvyukov commented May 15, 2017

Running a benchmark with fast iterations and heavy setup for 1 iteration does not make sense. The numbers at the bottom is setup cost, not the cost of chan operation. Run them for more iterations.

@bradfitz
Copy link
Contributor Author

@dvyukov, see also the original benchmark at top. But ignoring specific benchmarks (flawed or not), do you have any thoughts?

@huguesb
Copy link
Contributor

huguesb commented May 15, 2017

@josharian @dvyukov the addition of a waitgroup fixes the 100k worker case:

package selbench

import (
	"fmt"
	"math/rand"
	"sync"
	"testing"
	"time"
)

func BenchmarkSelectContended(b *testing.B) {
	for workers := 1; workers <= 100000; workers *= 10 {
		b.Run(fmt.Sprintf("workers=%d", workers), func(b *testing.B) {
			done := make(chan struct{})
			// Spin up workers, each with their own work channel.
			wc := make([]chan struct{}, workers)
			var wg sync.WaitGroup
			wg.Add(workers)
			for i := range wc {
				c := make(chan struct{})
				go func() {
					wg.Done()
					for {
						select {
						case <-done:
							return
						case <-c:
						}
					}
				}()
				wc[i] = c
			}
			wg.Wait()
			b.ResetTimer()
			// Have the workers do b.N units of work.
			b.RunParallel(func(pb *testing.PB) {
				src := rand.New(rand.NewSource(time.Now().UnixNano()))
				for pb.Next() {
					wc[src.Intn(len(wc))] <- struct{}{}
				}
			})
			b.StopTimer()
			close(done)
		})
	}
}

This gives me the following on a 4 core i7 (w/ hyper threading):

name                              time/op
SelectContended/workers=1          408ns ± 1%
SelectContended/workers=1-2        483ns ± 0%
SelectContended/workers=1-4        528ns ± 2%
SelectContended/workers=1-8        595ns ± 4%
SelectContended/workers=10         454ns ± 0%
SelectContended/workers=10-2       536ns ± 1%
SelectContended/workers=10-4       634ns ± 3%
SelectContended/workers=10-8       880ns ±10%
SelectContended/workers=100        490ns ± 2%
SelectContended/workers=100-2      522ns ± 2%
SelectContended/workers=100-4      596ns ± 3%
SelectContended/workers=100-8      855ns ± 6%
SelectContended/workers=1000       582ns ± 2%
SelectContended/workers=1000-2     510ns ± 5%
SelectContended/workers=1000-4     563ns ± 3%
SelectContended/workers=1000-8     838ns ± 5%
SelectContended/workers=10000      913ns ± 5%
SelectContended/workers=10000-2    542ns ± 2%
SelectContended/workers=10000-4    533ns ± 4%
SelectContended/workers=10000-8    911ns ± 4%
SelectContended/workers=100000    1.70µs ± 7%
SelectContended/workers=100000-2   915ns ± 7%
SelectContended/workers=100000-4   685ns ±12%
SelectContended/workers=100000-8  1.15µs ± 3%

@dvyukov
Copy link
Member

dvyukov commented May 15, 2017

Lock-free channels help to some degree (#8899):

tip:
BenchmarkSelectContended10000-64 2000000 1985 ns/op
BenchmarkSelectContended10000-64 2000000 1929 ns/op
BenchmarkSelectContended10000-64 2000000 2031 ns/op

3c3be62 (base for lock-free chans):
BenchmarkSelectContended10000-64 50000 110105 ns/op
BenchmarkSelectContended10000-64 50000 110315 ns/op
BenchmarkSelectContended10000-64 50000 113208 ns/op

3c3be62 with lock-free chans:
BenchmarkSelectContended10000-64 5000000 1141 ns/op
BenchmarkSelectContended10000-64 5000000 1150 ns/op
BenchmarkSelectContended10000-64 5000000 1208 ns/op

@dvyukov
Copy link
Member

dvyukov commented May 15, 2017

@bradfitz What is that shared chan in http? srv.closeDoneChanLocked? I don't see where it is checked during connection serving.
The select is slow, but it does not seem to be too slow. Also numbers does not seem to depend too much on number of goroutines (I guess it is that you just mostly need enough of them to create contention from multiple CPUs).

Shouldn't Served.Shutdown do closeDoneChanLocked before closeListenersLocked? Otherwise error handling logic in Server.Serve is broken. I am not sure if there will a temp error on listener close or not, but bad both ways.

@tombergan
Copy link
Contributor

The shared chan is gracefulShutdownCh in x/net/http2.

@dvyukov
Copy link
Member

dvyukov commented May 16, 2017

How permanent there connections? FWIW we could create a per-conn done chan, add it to a global list of done chans once and then select on it. Shutdown will then need to go over all registered chans and close all of them.

@tombergan
Copy link
Contributor

We already did that :) See the CLs in #20302. This bug is about finding a more general solution.

@dvyukov
Copy link
Member

dvyukov commented May 16, 2017

What is that http2/server.go file? Does not seen to be in std checkout.

@dvyukov
Copy link
Member

dvyukov commented May 16, 2017

Starting a goroutine for awaitGracefulShutdown looks somewhat wasteful. But up to you.

@tombergan
Copy link
Contributor

tombergan commented May 16, 2017

http2/server.go is in golang.org/x/net/http2. The goroutine was a quick fix for a possible go 1.8.2 release. See https://golang.org/cl/43455 and https://golang.org/cl/43230 and feel free to comment on that last CL if you have other ideas.

We're getting off topic. The general problem this bug is trying to address is demonstrated by the benchmark in Brad's original comment.

Edit: fixed CL link. By "that last CL", I meant https://golang.org/cl/43230.

@bradfitz
Copy link
Contributor Author

We're getting off topic. The general problem this bug is trying to address is demonstrated by the benchmark in Brad's original comment.

Yes, let's use the right bugs for the right discussion.

This bug is about the general problem in the Go runtime.

#20302 is about working around the problem in http/http2.

kubernetes/kubernetes#45216 is the original Kubernetes speed regression/rollback bug.

@dvyukov
Copy link
Member

dvyukov commented May 16, 2017

@dvyukov, see also the original benchmark at top. But ignoring specific benchmarks (flawed or not), do you have any thoughts?

I think this case is inherently hard. Multiple threads are hammering the same complex mutex-protected object. When memory is write-shared a simple memory load can take up to 100ns. And the mutex makes it all worse.

The numbers are need to be interpreted carefully. We divide work across multiple threads. So this:

BenchmarkSelectContended/workers=1000         	  200000	       583 ns/op
BenchmarkSelectContended/workers=1000-64      	  200000	       708 ns/op

rescaled to op/per-core is actually (assuming that the machine does have 64 cores):

BenchmarkSelectContended/workers=1000         	  200000	       583 ns/op
BenchmarkSelectContended/workers=1000-64      	  200000	     45312 ns/op

so in some cases we see up to 100x slowdown.

There is no easy way to fix this entirely. That would require some monstrous distributed construct that consumes considerably more memory and then some logic to detect when a chan actually falls into this case and dynamically alter the algorithm to a different scheme, and then alter it back if we see that we misguessed (e.g. sends appear of what we though is a close-notification chan). I don't think it's worth it. At least we have lots of lower hanging fruits.

Now, what I think we should do to improve things to some degree is:

  1. Take the finer-grained locking in select logic from https://codereview.appspot.com/12544043 and apply it. It's orthogonal to the lock-free stuff. It will considerably reduce duration of critical section.
  2. Optimize runtime mutex more. Off the top of my head: apply the so called "thundering herd" optimization; don't call futex wake when there is nobody to wake; tune spinning logic; maybe don't wake waiters when there are spinning threads.
  3. We may try the so called combining locks (https://software.intel.com/en-us/blogs/2013/02/22/combineraggregator-synchronization-primitive). Sometimes they are useful to somewhat improve behavior of highly contented data structures.
  4. Do some spot optimizations in the select, esp inside critical sections.

And a more realistic benchmark for this case would be to add a bunch of local non-ready chans to the select, because that's what http2 does, and these additional chans significantly prolong critical section in select.

@navytux
Copy link
Contributor

navytux commented May 24, 2018

  1. Take the finer-grained locking in select logic from https://codereview.appspot.com/12544043 and apply it. It's orthogonal to the lock-free stuff. It will considerably reduce duration of critical section.

For the reference: fine-grained locking for select is #8896.

iwasaki-kenta referenced this issue in perlin-network/noise May 9, 2019
@CAFxX
Copy link
Contributor

CAFxX commented May 28, 2020

I am experimenting with generalizing the single-channel select optimizations done by the compiler to multiple channels. A quick prototype implemented in the runtime looks promising, but before spending too much time on it (there's one regression that needs work) I wanted to ask for help in understanding if the approach is, in general, ok from the perspective of the language spec. I tried to ask on gophers slack, but we could not find a good argument as to why this shouldn't be allowed.

The gist of the idea is to try to avoid locking all channels, if we detect that any of the channels is ready.

Basically, are the following transformations allowed by the spec?

// non-blocking case
select {
case <-f1():
case <-f2():
default:
  // body default
}

to

c1, c2 := f1(), f2()
if randbool() {
	c1, c2 = c2, c1
}
select {
case <-c1:
default:
  select {
  case <-c2:
  default:
    // body default
  }
}

and

// blocking case
select {
case <-f1():
case <-f2():
}

to

c1, c2 := f1(), f2()
if randbool() {
	c1, c2 = c2, c1
}
select {
case <-c1:
default:
  select {
  case <-c2:
  default:
    select {
    case <-c1:
    case <-c2:
    }
  }
}

The one I'm least confident about is the non-blocking case, as I'm specifically not sure whether picking the default case in that way is correct. This could possibly be a safer, but slower, approach:

c1, c2 := f1(), f2()
if randbool() {
	c1, c2 = c2, c1
}
select {
case <-c1:
default:
  select {
  case <-c2:
  default:
    select {
    case <-c1:
    case <-c2:
    default:
      // body default
    }
  }
}

@ianlancetaylor
Copy link
Member

@CAFxX This issue is about a specific problem that I don't think is addressed by your proposal. So this issue is not the best place to discuss your idea. The best place to discuss it would be the golang-dev mailing list. Thanks.

@CAFxX
Copy link
Contributor

CAFxX commented May 28, 2020

Sure, will do, although it is definitely addressing the problem of lock contention (as it reduces the number of sellock/selunlock calls, and therefore the total number of lock2/unlock2 calls, except for the blocking case in which no channel is ready) and it should help with both bradfitz's and josharian's benchmarks (although I haven't run them yet).

FWIW, you actually proposed almost the same idea in your first comment: #20351 (comment)


update: asked on https://groups.google.com/forum/#!topic/golang-dev/jX4oQEls3uk

@ianlancetaylor
Copy link
Member

Honestly I don't see my suggestion as the same as yours, since I was suggesting looking at all the channels without taking any locks, which is not what your code does. I may well be missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance Scalability Issues related to runtime/application scalability
Projects
None yet
Development

No branches or pull requests