-
-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang during Connecting.await for incoming connections #650
Comments
This is expected behavior if the receiver's
This sounds like a bug. If you add |
I reworked the test case to better isolate the hang. Now it's clear no packets are being sent/received once hung. (the packets I was observing were from connection attempts from other connections.) Here is the tail of the trace; the full trace is here. Show trace
This makes sense, thanks for clarifying. |
I've just re-run it with some of quinn's anti-amplification logic (which shows up in the above trace) disabled. Unfortunately it did not resolve it, but here's the differing trace in case it's useful: Show trace
|
I suspect this might be due to the somewhat dubious handshake state machine in draft 24. I'm going to try to get us updated to draft 27 and then revisit. |
The draft 27 update has been merged. |
I'm still seeing the same hang on current master. From running the test case a few times I believe it is less frequent, though I could be wrong and I'll test more tomorrow. I'm also seeing #670. |
Thanks for checking. I haven't been able to reproduce either case using the current reduced-hang branch, having ran up to 100k iterations; updated traces could be helpful. |
Reviewing the above traces, it looks like the server is functioning normally on the -proto level (i.e. it deems the connection established and processes the client's stream frames), so there must be something wrong with the pathway that wakes up |
In the second-to-last trace above, I believe
indicates that control has reached a point which must pass through I'm having trouble seeing how that chain could malfunction. Do I recall correctly you mentioning you suspected a tokio issue previously? |
@alecmocatta would be cool if you can investigate some more with current master (which includes some fixes). Please also make sure to run the latest tokio, and then maybe add some debugging based on @Ralith's pointers above? Let me know if you have more questions. |
The issue that I suspected was related to tokio I saw only when spawning onto a multi-threaded runtime, and it involved two threads busy-looping. I don't think it's likely to be triggering this as well but it's not impossible. Under my setup - a 1-core linux 4.18 VM running under VirtualBox, after having run the above |
Ah, but that repository is still using the sync-defragmented branch? Should try with master. |
@djc Check out the |
Thanks for verifying! I forgot about the netem stuff, will investigate further today. Given the above investigation, I'm sure I can at least find an assumption getting invalidated somewhere, assuming I was looking at the right ID. |
With netem I'm able to reproduce both issues very consistently--often in the first iteration, in fact. Thanks! |
I'm pretty sure I see what's happening with the hang:
At this point the server cannot take any further action on the connection initiated by the duplicated packet. Because you've disabled the idle timeout, the connection is permanently hung. This is working as intended. In summary, the idle timeout must not be disabled in environments where a client might disappear unexpectedly or packets may be duplicated and no other mechanism exists to clean up zombie connections. I'll prepare a PR to update the documentation to clarify this. |
Thanks for that comprehensive explanation @Ralith! |
Running my previous test case further surfaces two more issues:
ConnectionError::Reset
on line 82. This seems possibly a bug but like theApplicationClosed
error in my previous example it doesn't block progress so is ignored;Connecting.await
on line 103. strace shows packets are being sent and received, but this.await
never returns.I used the following to simulate an unreliable network on Linux:
Note this seems to be buggy on kernels < 4.18.
The text was updated successfully, but these errors were encountered: