-
Notifications
You must be signed in to change notification settings - Fork 42
goroutine build up stuck in dialing through quic #57
Comments
I'm not sure I understand the issue here. You're trying to open a streams, and at some point that blocks because you're only allowed to open a limited number of streams. |
How many streams do we allow by default? We should probably fail when we hit such a limit instead of blocking (like "too many file descriptors"). |
We allow 1000 streams: go-libp2p-quic-transport/transport.go Lines 21 to 22 in 2c4db47
Returning an error is easy, the quic-go API exposes two different functions for that: OpenStreamSync , which blocks, and OpenStream , which errors when reaching the limit.
Applications need to be very careful with catching this error though. An error returned by |
Got it. Let's switch to |
The reason to use |
We at least need to obey the context crom It looks like we'll have to fix https://github.com/libp2p/go-stream-muxer/issues/27 first.
If we were using streams as designed for, e.g., HTTP, that would make sense. However, we usually have a bunch of long-running streams.
In a lot of cases, we'll just give up. |
Hi there, Just wanted to report we're seeing this too. Opening a stream just hangs indefinitely.
In // OpenStream creates a new stream.
func (c *conn) OpenStream() (mux.MuxedStream, error) {
qstr, err := c.sess.OpenStreamSync(context.Background())
return &stream{Stream: qstr}, err
} So it is not entirely clear to me why at least a timeout is not present. The context parameter in |
The right fix for that is probably to propagate the context from |
Yep I've seen the issue. Are you accepting contributions for this? Which dependent packages/repos are involved if I'd pick this up? If the change is done and pending can I expect it to be merged? or will this be delayed due to the upcoming mainnet milestone for filecoin? |
Unfortunately, the real fix involves propagating changes to everything that uses libp2p streams. For now, the best we can do is set a very short stream open timeout internally (or just not wait at all). @marten-seemann can you tackle this? edit: by "this" I mean setting a reasonable timeout? |
@Stebalien Yeah I understood that this change does not involve just changing the interface method signature (and breaking all the other packages that implement the interface).
|
It will also need to be propagated to everything that uses go-libp2p and will be a significant breaking change. We'll likely want to bundle it with libp2p/go-libp2p-core#9. Basically, this change will need to be made by someone with write access to go-ipfs & go-libp2p repos, and contacts with most users of go-libp2p. It's a refactor that needs to be done (or at least shepherded) by a member of the core libp2p team. It has been a low priority because you should only hit it if you manage to open 1000 streams to a single peer and never close them. If you're hitting this, you definitely have a bug somewhere else in your code (probably related to libp2p/go-libp2p-core#9). |
This actually seems to have been hit because of some infrastructure configuration problem, not an actual bug in the code. It is still not entirely clear on why nodes were connecting over QUIC successfully but It seems that // Block if we have too many inflight SYNs
select {
case s.synCh <- struct{}{}:
case <-s.shutdownCh:
return nil, s.shutdownErr
} I'll try to have a look at the satellite repos in the upcoming days to see if I could help with this. |
That is very strange. It could be a keepalive bug in QUIC? |
Not entirely sure as we are using custom discovery that we've built on top of |
Are you using go-ipfs, or QUIC + libp2p directly?
|
we're using |
Got it. Libp2p will prefer the QUIC transport, in general, as it doesn't require a file descriptor per connection and the handshake is faster. This means:
But that's shouldn't be relevant. Try calling |
We now feed down context so this should be "fixed". |
We are seeing a slow goroutine build up in our relays, stuck in dialing through quic:
The reason for dialing these peers in the first place is yet unknown, but the build up is more alarming and indicative of a bug.
The daemon is running with the latest
go-libpp2p-quic-transport
relase, v0.0.3The text was updated successfully, but these errors were encountered: