Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

goroutine build up stuck in dialing through quic #57

Closed
vyzo opened this issue Apr 10, 2019 · 19 comments
Closed

goroutine build up stuck in dialing through quic #57

vyzo opened this issue Apr 10, 2019 · 19 comments
Assignees

Comments

@vyzo
Copy link
Contributor

vyzo commented Apr 10, 2019

We are seeing a slow goroutine build up in our relays, stuck in dialing through quic:

    165407   runtime.gopark
             runtime.goparkunlock
             sync.runtime_notifyListWait
             sync.(*Cond).Wait
             github.com/lucas-clemente/quic-go.(*outgoingBidiStreamsMap).OpenStreamSync
             github.com/lucas-clemente/quic-go.(*streamsMap).OpenStreamSync
             github.com/lucas-clemente/quic-go.(*session).OpenStreamSync
             github.com/libp2p/go-libp2p-quic-transport.(*conn).OpenStream
             github.com/libp2p/go-libp2p-swarm.(*Conn).NewStream
             github.com/libp2p/go-libp2p-swarm.(*Swarm).NewStream
             github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStream
             github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).NewStream
             github.com/libp2p/go-libp2p-circuit.(*Relay).DialPeer
             github.com/libp2p/go-libp2p-circuit.(*Relay).Dial
             github.com/libp2p/go-libp2p-circuit.(*RelayTransport).Dial
             github.com/libp2p/go-libp2p-swarm.(*Swarm).dialAddr
             github.com/libp2p/go-libp2p-swarm.(*dialLimiter).executeDial

The reason for dialing these peers in the first place is yet unknown, but the build up is more alarming and indicative of a bug.
The daemon is running with the latest go-libpp2p-quic-transport relase, v0.0.3

@marten-seemann
Copy link
Collaborator

I'm not sure I understand the issue here. You're trying to open a streams, and at some point that blocks because you're only allowed to open a limited number of streams.

@Stebalien
Copy link
Member

How many streams do we allow by default? We should probably fail when we hit such a limit instead of blocking (like "too many file descriptors").

@marten-seemann
Copy link
Collaborator

marten-seemann commented Apr 11, 2019

We allow 1000 streams:

var quicConfig = &quic.Config{
MaxIncomingStreams: 1000,

Returning an error is easy, the quic-go API exposes two different functions for that: OpenStreamSync, which blocks, and OpenStream, which errors when reaching the limit.

Applications need to be very careful with catching this error though. An error returned by OpenStreamSync is always fatal (connection has gone away or so).

@Stebalien
Copy link
Member

Got it. Let's switch to OpenStream for now unless you can think of a reason not to. We also need some kind of resource management message for the relay transport...

@marten-seemann
Copy link
Collaborator

The reason to use OpenStreamSync is that you're usually expecting using less than 1000 streams, and the only time you go over this limit is temporarily. So it makes sense to wait until one of those 1000 streams is closed.
Really, the only thing you can do when receiving an error from OpenStream is retrying.

@Stebalien
Copy link
Member

We at least need to obey the context crom network.NewStream. Unfortunately, it doesn't look like we feed this all the way through to smux.OpenStream.

It looks like we'll have to fix https://github.com/libp2p/go-stream-muxer/issues/27 first.

The reason to use OpenStreamSync is that you're usually expecting using less than 1000 streams, and the only time you go over this limit is temporarily. So it makes sense to wait until one of those 1000 streams is closed.

If we were using streams as designed for, e.g., HTTP, that would make sense. However, we usually have a bunch of long-running streams.

Really, the only thing you can do when receiving an error from

In a lot of cases, we'll just give up.

@acud
Copy link

acud commented Jun 28, 2020

Hi there,

Just wanted to report we're seeing this too. Opening a stream just hangs indefinitely.

goroutine 1306535 [select, 1333 minutes]:
github.com/lucas-clemente/quic-go.(*outgoingBidiStreamsMap).OpenStreamSync(0xc0032a76c0, 0x128be20, 0xc00012c010, 0x0, 0x0, 0x0, 0x0)
	github.com/lucas-clemente/[email protected]/streams_map_outgoing_bidi.go:91 +0x228
github.com/lucas-clemente/quic-go.(*streamsMap).OpenStreamSync(0xc001c3f9c0, 0x128be20, 0xc00012c010, 0xaf00000000000005, 0xfd98d1, 0x6, 0xc000767180)
	github.com/lucas-clemente/[email protected]/streams_map.go:113 +0x47
github.com/lucas-clemente/quic-go.(*session).OpenStreamSync(0xc00072e780, 0x128be20, 0xc00012c010, 0xc3134c, 0xc0001339b0, 0xc000d58000, 0xc001e32990)
	github.com/lucas-clemente/[email protected]/session.go:1354 +0x48
github.com/libp2p/go-libp2p-quic-transport.(*conn).OpenStream(0xc002c4ce00, 0xc00028d950, 0xc001dd7da0, 0x0, 0xc00398c830)
	github.com/libp2p/[email protected]/conn.go:41 +0x4f
github.com/libp2p/go-libp2p-swarm.(*Conn).NewStream(0xc001dd7da0, 0xc001e32990, 0x27, 0xc001dd7da0, 0x2)
	github.com/libp2p/[email protected]/swarm_conn.go:171 +0x37
github.com/libp2p/go-libp2p-swarm.(*Swarm).NewStream(0xc000133980, 0x128be60, 0xc00169c900, 0xc001e32990, 0x27, 0x1, 0x1, 0xc0044637e0, 0x1)
	github.com/libp2p/[email protected]/swarm.go:325 +0x163
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStream(0xc0001a0420, 0x128be60, 0xc00169c900, 0xc001e32990, 0x27, 0xc00190ca40, 0x20, 0x20, 0x0, 0x0, ...)
	github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:486 +0x6e
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).NewStream(0xc0001a0420, 0x128be60, 0xc00169c900, 0xc001e32990, 0x27, 0xc0044637c0, 0x1, 0x1, 0xc000130060, 0x41331a, ...)
	github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:438 +0x4a9

In conn.go new stream context is taken from context.Background():

// OpenStream creates a new stream.
func (c *conn) OpenStream() (mux.MuxedStream, error) {
	qstr, err := c.sess.OpenStreamSync(context.Background())
	return &stream{Stream: qstr}, err
}

So it is not entirely clear to me why at least a timeout is not present. The context parameter in NewStream method in libp2p alludes to the fact that cancelling the context might actually result in the desired behavior, which in this case is misleading. It would be good (and responsible) to document this behavior in the libp2p documentation or at least under a known issues section in the README.md in this repo.

@marten-seemann
Copy link
Collaborator

The right fix for that is probably to propagate the context from NewStream all the way down to the transport, see libp2p/go-libp2p-core#86.

@acud
Copy link

acud commented Jun 29, 2020

Yep I've seen the issue. Are you accepting contributions for this? Which dependent packages/repos are involved if I'd pick this up? If the change is done and pending can I expect it to be merged? or will this be delayed due to the upcoming mainnet milestone for filecoin?

@Stebalien
Copy link
Member

Stebalien commented Jun 29, 2020

Unfortunately, the real fix involves propagating changes to everything that uses libp2p streams. For now, the best we can do is set a very short stream open timeout internally (or just not wait at all).

@marten-seemann can you tackle this?

edit: by "this" I mean setting a reasonable timeout?

@acud
Copy link

acud commented Jun 29, 2020

@Stebalien Yeah I understood that this change does not involve just changing the interface method signature (and breaking all the other packages that implement the interface).
That's why I asked:

  • Which dependent packages/repos are involved if I'd pick this up (i.e. which repos that implement the interface would need to be changed propagate the context object)
  • Would this contribution be delayed due to filecoin milestones? (I know they are about to go mainnet soon, so, would this sort of change be subjected to some code freeze?)

@Stebalien
Copy link
Member

It will also need to be propagated to everything that uses go-libp2p and will be a significant breaking change. We'll likely want to bundle it with libp2p/go-libp2p-core#9.

Basically, this change will need to be made by someone with write access to go-ipfs & go-libp2p repos, and contacts with most users of go-libp2p. It's a refactor that needs to be done (or at least shepherded) by a member of the core libp2p team.

It has been a low priority because you should only hit it if you manage to open 1000 streams to a single peer and never close them. If you're hitting this, you definitely have a bug somewhere else in your code (probably related to libp2p/go-libp2p-core#9).

@acud
Copy link

acud commented Jun 29, 2020

If you're hitting this, you definitely have a bug somewhere else in your code

This actually seems to have been hit because of some infrastructure configuration problem, not an actual bug in the code. It is still not entirely clear on why nodes were connecting over QUIC successfully but NewStream used to hang. We're still exploring this but have had to disable QUIC by default for now. TCP works fine.

It seems that mplex and yamux need this context object propagated. mplex overrides with some arbitrary 1 minute timeout in NewNamedStream and yamux could use a select case on context cancellation in OpenStream:

	// Block if we have too many inflight SYNs
	select {
	case s.synCh <- struct{}{}:
	case <-s.shutdownCh:
		return nil, s.shutdownErr
	}

I'll try to have a look at the satellite repos in the upcoming days to see if I could help with this.

@Stebalien
Copy link
Member

It is still not entirely clear on why nodes were connecting over QUIC successfully but NewStream used to hang

That is very strange. It could be a keepalive bug in QUIC?

@acud
Copy link

acud commented Jun 29, 2020

Not entirely sure as we are using custom discovery that we've built on top of libp2p. I'll need to dig into why QUIC was selected as the observed or preferred underlay transport in the first place. But I'll definitely keep you informed if we find this to be a bug in QUIC or libp2p 👍

@Stebalien
Copy link
Member

Stebalien commented Jun 29, 2020 via email

@acud
Copy link

acud commented Jun 29, 2020

we're using libp2p + QUIC directly, although we do not have any QUIC aware implementations, but just using it as a pluggable transport configured alongside with libp2p.

@Stebalien
Copy link
Member

Got it. Libp2p will prefer the QUIC transport, in general, as it doesn't require a file descriptor per connection and the handshake is faster. This means:

  1. TCP connections can get stuck in a TCP dial queue (we limit the number of parallel TCP dials to avoid using too many file descriptors).
  2. QUIC connection handshakes finish faster. This means that, even if a QUIC and TCP connection start at the same time, the QUIC connection usually finishes first (and we'll cancel the TCP dial).

But that's shouldn't be relevant. Try calling conn.GetStreams() on a connection that should have no streams. If you see a bunch of streams, you're probably running into the issue I linked to above.

@Stebalien
Copy link
Member

We now feed down context so this should be "fixed".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants