-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[test] fix hang in macOS variants of reverse diagnostics server tests #40225
Conversation
Your description above seems to refer to C++ documentation, but the issue seems to be fixed in C# code. The behavior of Based on fixing the race, I think the solution proposed seems about right for C++ code. I am just not sure why we created the race ... Why are we tearing down the socket and recreating it? I'm not very familiar with Unix sockets, but it doesn't match the pattern I have seen for other types of sockets. Generally the initial socket stays open. The bind creates a socket for the connection. When the connection is closed the connection specific socket is closed. Apparently the unix domain socket is using named files for the socket. Are we not creating a unique socket name for each connection? How is the connection specific filename being discovered by the client? I would assume the well known socket name should only be closed at app shutdown. |
The BCL ultimately calls the system APIs for things like socket communication and add a C# API on top of it. The issue I described above is with the remote reverse server, not the code in runtime itself. This is effectively a test issue and not a product issue. The runtime is doing the correct thing, but the "correct" thing is resulting in a hang because the system level APIs are not behaving. I would not expect a well behaved reverse server to operate like this. I would expect it to behave as you describe, i.e., create the socket once, bind to it, and manage the returned sockets from calls to The race I observed seems to be something unique to macOS and how the socket APIs behave. I'd be curious to see if this behavior repros on other BSD based operating systems. I would expect the BCL to fall prey to this same race on macOS since Syetm.Net.Sockets ultimately calls the same APIs as my C++/C# code is. There isn't anything in the BCL that could prevent this since it has to do with the order that system level APIs are called and not the infrastructure (like the async engine) that the BCL adds on top of that. |
FYI CI is all green, but GH says the live build is still running. |
@josalem that is an occasional glitch we have seen that was reported to Github. |
These sound contradictory? : ) If the runtime can't handle something that it should be able to handle, that is the definition of a product issue. I think the discussion is making one of a few assumptions. I want to make it explicit which one and make sure everyone agrees with the choice: A few other misc thoughts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the changes look fine, but I am curious about the Listen() change in the test?
Co-authored-by: Noah Falk <[email protected]>
* IpcStream didn't use the serverSocket param anymore and would cause IpcStreams to show up as SERVER types. This didn't cause any issues, but will make more sense in the debugger * Added back the backlog for the reverse server since that wasn't relevant to the issue.
@noahfalk, I decided to do some empirical tests to validate my hypothesis. I used the original code with a timeout of 5 minutes on the EventPipe session (instead of 500 ms) and placed breakpoints at various points on both sides of the connection. I was unable to force the behavior to happen while under the debugger, so I ran a stress test in the background of 8 instances running in an infinite loop while I debugged both halves of a 9th instance. I was still unable to reproduce the behavior. The 8 instances running in the background, however, all individually reproduced the behavior within the first 5 minute loop. The increase in the EventPipe session time causes the socket-bind-accept-read-close cycle to happen very quickly for ~5 minutes, which appears to have exacerbated the issue. I discovered that the Typically, when you read an entry for a Unix domain socket in
Where the address under device is the kernel address of the socket objects and the data under Name is the "thing" it is connected to. In this case, it is the address of the socket at the other end. Typically, when I grep the output of
In the hung state that the process ends up in, there is only one entry in
Somehow, we get into a state where no one owns the other end of the valid connection the runtime has. This state reproduces with both the C++ reverse server I wrote and the System.Net.Sockets reverse server I wrote. Typically, this state is what would cause calls to To validate whether this problem is unique to the runtime I wrote a pair of c++ programs that follow the same pattern as the runtime and the test server and I was able to reproduce the issue immediately on my mac. If you don't ignore From this I posit that this is some form of non-deterministic behavior with macOS's implementation of All that said, and more to your line of questioning, Noah, I don't think this is purely a test issue. There is a small corner case that is macOS specific where the runtime can non-deterministically hang. I don't believe this to be high risk, though. My observations indicate this only happens on macOS and only happens in this pathological case. The test is specifically trying to simulate a crash-looping reverse server and more specifically an extreme case of that. Please let me know. if you see this as higher risk. I'll continue to look into this to see if there is a way to prove whether the issue is in the kernel. |
Nice investigating @josalem! To be pedantic I think your discussion above is picking option (b)?
That feels like a very reasonable choice to me. |
@noahfalk, yes, you are correct. A very long-winded way to say "option b" 😆 . I left these changes running for >8 hours with the timeout in the test being 5 minutes and hit no hangs. Barring any objection, I'll merge this in this afternoon. |
No, it does not fix it. That is an unrelated issue that is due to the GC and the Thread Pool trying to take the same lock when under GCStress. I'm okay leaving the test off to avoid the assert, but I think the thread pool folks or GC folks should look at the issue in case it could happen outside GCStress. |
…dotnet#40225) Co-authored-by: Noah Falk <[email protected]>
Port of dotnet#40225.
fix #39979
See #39979 for the gist of what this issue is. The hang in the test appears to come from a race between
close
,connect
, andunlink
across the reverse server and the runtime. This only happens on macOS. When the reverse server callsclose
, it causes the runtime to break out ofpoll
with aPOLLHUP
signal. If the runtime beats the reverse server and callsconnect
before the server can callunlink
on the Unix domain socket, the runtime can successfully open a connection, but the connection is orphaned by the call tounlink
andclose
and isn't associated with the new socket that the server subsequently callsbind
bind.This PR fixes the behavior by
unlink
ing (read:File.Delete(...)
ing) the Unix domain socket before callingclose
on the server and client sockets. I'm not sure why this behavior is different between macOS and Linux. the man page for socket(7) says thatclose(2)
will return immediately and be completed in the background. My guess is that since this behavior isn't explicitly defined, it simply varies between Linux and BSD implementations of the API. I wrote a simplified and synchronous version of this reverse server in C++ and ran it for over 2 hours without an issue. Before these changes, I could reliably observe a test hang locally within an hour. I'm running this PR's version of the test in an infinite loop in 8 consoles as a stress test to see if I observe it. I'll leave that running while this is in review.This PR also includes a bunch of extra logging inside the diagnostic server that will make future issues like this easier to diagnose.
CC @tommcdon @sywhang @dotnet/dotnet-diag
--
Update: 8 instances of the stress test have been running for >4 hours in an infinite loop with no issues encountered. I'm going to stop the stress test, but I feel confident this change has resolved the hang.