-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug(transport): sendmsg: invalid argument
when using QUIC on latest version
#2591
Comments
Do you have |
Yes, it is on linux with x86_64 arch. QUIC_GO_DISABLE_GSO seems to work to fix this, but need to try to fully reproduce the scenario again/need some more uptime to confirm |
You don't have to try/change it. I just want to know when the bug happened what the state of this environment variable was. |
It was not set, so false |
@distractedm1nd Which operating system are you using? Apparently there are some problems with FreeBSD, see quic-go/quic-go#4105 and quic-go/quic-go#4106. |
@marten-seemann We are using alpine linux for these nodes - thanks for linking the issues, I'll look into them |
Can you try setting This way, we can find out if GSO or ECN is the problem. |
Yes, I will be trying to recreate the scenario today. Once I can get the logs consistently again, I will experiment with the flags and post the results here |
Would the next step be enabling QUIC debug logs and finding out what is happening right before this occurs? |
@distractedm1nd Can you share details about the environment you're running this on? We've had reports of GSO failures on some platforms (quic-go/quic-go#3911), although a lot less after the v0.38.0 release. Alternatively, could you add some logging in https://github.com/quic-go/quic-go/blob/master/sys_conn_oob.go#L240, so we can see what exactly the arguments to |
@marten-seemann, is there a way to QLOG this or do you want us to print out the inputs there? |
sendmsg: invalid argument
when using QUIC on latest versionsendmsg: invalid argument
when using QUIC on latest version
@Wondertan There's no way to qlog this at the moment. We could change the code to pass a qlogger into that function if we need to for debugging, but for now I suggest we do the easy thing and just print the inputs. This is assuming that this bug can be triggered reliably and we won't spam the log with many entries from unrelated connections. |
I added logs to WritePacket right before the WriteMsgUDP and they are not being logged. I am however seeing in the logs spammed:
for both server and client. Am I doing something wrong here, or do I need to add logs to a different method? I am logging with |
That will only work if you run with |
Yes, we ran with |
Disables QUIC by default yet allows to enable it with ENV var. This is temporary and until libp2p/go-libp2p#2591 is investigated and fixed. The current solution disables QUIC programmatically while we still keep writing in the config QUIC listen addresses by default. This removes the need for users to reinit their configs once we turn QUIC back by default. This works perfectly fine on mocha.
@distractedm1nd Do you have any update on this? |
Sorry Marten, I've been out this week for my wedding and will get back to it asap next week. Last attempts didn't reproduce the issue unfortunately |
We should check if quic-go/quic-go#4127 is related |
Please do. There’s a quic-go v0.39.2 patch release that contains fixes for both FreeBSD and Linux. |
Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days. |
Closing since this has (probably) been fixed by the latest quic-go release. Please feel free to reopen if it still happens after updating your dependencies. |
Reverts disabled quic. We weren't able to reproduce [the issue](libp2p/go-libp2p#2591) since we first observed it, as well as quic released changes to address this. Even if the issue is there, I think it's more or less safe as we always have TCP as fallback
Since upgrading to v0.31.0, we encountered a new error:
sendmsg: invalid argument
.I am not entirely sure yet how the entire network got into this state (15 nodes), but all nodes were logging similarly and it persisted until the network degraded into a state where the protocol was no longer being used. After restarting the nodes, the network recovered and has been stable since. My theory is that it happened when a single node got overloaded that was acting as the original source of the information to be shared throughout the network. I will gladly try to recreate the scenario with whatever debugging flags are needed to gather more information.
I saw this issue in quic-go that seems to be resolved, but may be related: quic-go/quic-go#3911. In our network both the clients and servers arch is x86_64/
Clients started logging:
Servers of this protocol logged the same error along with a stateless reset for concurrent streams:
go.mod: https://github.com/celestiaorg/celestia-node/blob/main/go.mod
The text was updated successfully, but these errors were encountered: