Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] fix hang in macOS variants of reverse diagnostics server tests #40225

Merged
merged 6 commits into from
Aug 5, 2020

Conversation

josalem
Copy link
Contributor

@josalem josalem commented Aug 1, 2020

fix #39979

See #39979 for the gist of what this issue is. The hang in the test appears to come from a race between close, connect, and unlink across the reverse server and the runtime. This only happens on macOS. When the reverse server calls close, it causes the runtime to break out of poll with a POLLHUP signal. If the runtime beats the reverse server and calls connect before the server can call unlink on the Unix domain socket, the runtime can successfully open a connection, but the connection is orphaned by the call to unlink and close and isn't associated with the new socket that the server subsequently calls bind bind.

This PR fixes the behavior by unlinking (read: File.Delete(...)ing) the Unix domain socket before calling close on the server and client sockets. I'm not sure why this behavior is different between macOS and Linux. the man page for socket(7) says that close(2) will return immediately and be completed in the background. My guess is that since this behavior isn't explicitly defined, it simply varies between Linux and BSD implementations of the API. I wrote a simplified and synchronous version of this reverse server in C++ and ran it for over 2 hours without an issue. Before these changes, I could reliably observe a test hang locally within an hour. I'm running this PR's version of the test in an infinite loop in 8 consoles as a stress test to see if I observe it. I'll leave that running while this is in review.

This PR also includes a bunch of extra logging inside the diagnostic server that will make future issues like this easier to diagnose.

CC @tommcdon @sywhang @dotnet/dotnet-diag

--

Update: 8 instances of the stress test have been running for >4 hours in an infinite loop with no issues encountered. I'm going to stop the stress test, but I feel confident this change has resolved the hang.

@josalem josalem added os-mac-os-x macOS aka OSX test-bug Problem in test source code (most likely) area-Tracing-coreclr labels Aug 1, 2020
@josalem josalem added this to the 5.0.0 milestone Aug 1, 2020
@josalem josalem requested review from noahfalk, sywhang and a team August 1, 2020 00:26
@josalem josalem self-assigned this Aug 1, 2020
@sdmaclea
Copy link
Contributor

sdmaclea commented Aug 1, 2020

Your description above seems to refer to C++ documentation, but the issue seems to be fixed in C# code. The behavior of Socket should be specified and stable in C#. If it isn't this feels like it is in part a Framework bug. But it also seems like we are abusing the sockets... So maybe it is just our issue....

Based on fixing the race, I think the solution proposed seems about right for C++ code. I am just not sure why we created the race ...

Why are we tearing down the socket and recreating it? I'm not very familiar with Unix sockets, but it doesn't match the pattern I have seen for other types of sockets. Generally the initial socket stays open. The bind creates a socket for the connection. When the connection is closed the connection specific socket is closed.

Apparently the unix domain socket is using named files for the socket. Are we not creating a unique socket name for each connection? How is the connection specific filename being discovered by the client?

I would assume the well known socket name should only be closed at app shutdown.

@josalem
Copy link
Contributor Author

josalem commented Aug 1, 2020

The BCL ultimately calls the system APIs for things like socket communication and add a C# API on top of it.

The issue I described above is with the remote reverse server, not the code in runtime itself. This is effectively a test issue and not a product issue. The runtime is doing the correct thing, but the "correct" thing is resulting in a hang because the system level APIs are not behaving.

I would not expect a well behaved reverse server to operate like this. I would expect it to behave as you describe, i.e., create the socket once, bind to it, and manage the returned sockets from calls to accept. This test is specifically designed to stress the polling and retry logic of the runtime in the presence of a misbehaving reverse server. In other words, the test server is supposed to be doing the wrong thing and the runtime should be able to handle it.

The race I observed seems to be something unique to macOS and how the socket APIs behave. I'd be curious to see if this behavior repros on other BSD based operating systems. I would expect the BCL to fall prey to this same race on macOS since Syetm.Net.Sockets ultimately calls the same APIs as my C++/C# code is. There isn't anything in the BCL that could prevent this since it has to do with the order that system level APIs are called and not the infrastructure (like the async engine) that the BCL adds on top of that.

@josalem
Copy link
Contributor Author

josalem commented Aug 1, 2020

FYI CI is all green, but GH says the live build is still running.

@danmoseley
Copy link
Member

@josalem that is an occasional glitch we have seen that was reported to Github.

@noahfalk
Copy link
Member

noahfalk commented Aug 4, 2020

This is effectively a test issue and not a product issue
the test server is supposed to be doing the wrong thing and the runtime should be able to handle it.

These sound contradictory? : ) If the runtime can't handle something that it should be able to handle, that is the definition of a product issue.

I think the discussion is making one of a few assumptions. I want to make it explicit which one and make sure everyone agrees with the choice:
a) [Test issue] It was never intended for the runtime to work with a tool that does what the test did (prior to this change)
b) [Product Issue] It was intended to make the runtime work with a tool that does what the test did, however the value in doing so is not enough to justify the dev time, risk, or whatever else involved in resolving it
c) [Product issue] It was intended to make the runtime work with a tool that does what the test did, but it is impossible to implement this behavior on MacOS
d) something else?

A few other misc thoughts:

  1. None of the examples I see online suggest that a well-written unix domain socket server would need to delete the socket file prior to closing the socket. If customers write their own servers they probably aren't going to do it unless we doc it and they are diligently reading our docs (a dubious assumption).
  2. A tool that repeatedly creates and destroys the socket in a loop probably isn't something a customer would write. However a tool that calls unlink(), socket(), bind(), listen(), accept() in that order and then crashes and restarts at the beginning of the sequence seems pretty reasonable. In theory doing that in a loop with close() at the end simulates what would happen if that server crashed a lot, or if it crashed at just the right moment in time to cause us trouble. I assume we don't know definitively if crashing and relaunching does reproduce the same problem that we see with looping?
  3. The description of the issue certainly seemed surprising. If I understand correctly you are saying that poll() returns HUP after the server socket is closed, but calling connect() immediately afterwards is able to reconnect to the closed socket, this generates no error and future calls to poll() do not return HUP?

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes look fine, but I am curious about the Listen() change in the test?

src/coreclr/src/debug/inc/diagnosticsipc.h Outdated Show resolved Hide resolved
src/coreclr/src/vm/diagnosticserver.cpp Outdated Show resolved Hide resolved
src/coreclr/src/vm/ipcstreamfactory.cpp Outdated Show resolved Hide resolved
src/tests/tracing/eventpipe/common/Reverse.cs Outdated Show resolved Hide resolved
John Salem and others added 3 commits August 4, 2020 09:26
* IpcStream didn't use the serverSocket param anymore and would cause
  IpcStreams to show up as SERVER types.  This didn't cause any issues,
  but will make more sense in the debugger
* Added back the backlog for the reverse server since that wasn't
  relevant to the issue.
@josalem
Copy link
Contributor Author

josalem commented Aug 5, 2020

@noahfalk, I decided to do some empirical tests to validate my hypothesis. I used the original code with a timeout of 5 minutes on the EventPipe session (instead of 500 ms) and placed breakpoints at various points on both sides of the connection. I was unable to force the behavior to happen while under the debugger, so I ran a stress test in the background of 8 instances running in an infinite loop while I debugged both halves of a 9th instance. I was still unable to reproduce the behavior. The 8 instances running in the background, however, all individually reproduced the behavior within the first 5 minute loop. The increase in the EventPipe session time causes the socket-bind-accept-read-close cycle to happen very quickly for ~5 minutes, which appears to have exacerbated the issue.

I discovered that the netstat tool can produce a list of all Unix domain socket connections (both ends) on the system for macOS. Using the output of netstat -u in combination with lsof -p <pid> which lists the open connections for a specific process, I was able to verify one thing that lends credence to my conjecture that this is a kernel issue: the socket that the runtime is polling on is orphaned. There is no other process that owns the other half of the socket connection.

Typically, when you read an entry for a Unix domain socket in lsof you will see something like this:

COMMAND   PID    USER   FD     TYPE    DEVICE                   SIZE/OFF    NODE        NAME
corerun 59430 josalem   41u    unix    0xf20d83f7de65b59f       0t0                     ->0xf20d83f7de65b72f

Where the address under device is the kernel address of the socket objects and the data under Name is the "thing" it is connected to. In this case, it is the address of the socket at the other end.

Typically, when I grep the output of netstat -u for either of these addresses I should find 2 entries, one for each end of the connection, e.g.,

Active LOCAL (UNIX) domain sockets
Address          Type   Recv-Q Send-Q            Inode             Conn             Refs  Nextref Addr
<address A>      stream      0      0      <address B>                0                0        0 /path/to/unixdomainsocket
<address B>      stream      0      0      <address A>                0                0        0 /path/to/unixdomainsocket

In the hung state that the process ends up in, there is only one entry in netstat -u. Additionally, the address field in netstat isn't set to the path of the socket, so it looks like this:

Active LOCAL (UNIX) domain sockets
Address          Type   Recv-Q Send-Q            Inode             Conn             Refs          Nextref Addr
f20d83f7de65b59f stream      0      0                0 f20d83f7de65b72f                0                0

Somehow, we get into a state where no one owns the other end of the valid connection the runtime has. This state reproduces with both the C++ reverse server I wrote and the System.Net.Sockets reverse server I wrote.

Typically, this state is what would cause calls to write, read, etc. to trigger a SIGPIPE and kill the process. We specifically ignore this in the runtime (many socket servers do this), which causes alternate behavior. Specifically, this should cause functions like write and read to return -1 and set errno to EPIPE. poll's man page on BSD and Linux makes no mention of holding to this behavior and doesn't mention EPIPE as a failure case. It explains that it should return >0 and set revents to POLLHUP which is what it does on platforms other than macOS.

To validate whether this problem is unique to the runtime I wrote a pair of c++ programs that follow the same pattern as the runtime and the test server and I was able to reproduce the issue immediately on my mac. If you don't ignore SIGPIPE in the simplified repro, the runtime's surrogate will be killed with SIGPIPE on both Linux and macOS (this is the default handler for SIGPIPE and expected). Once you ignore SIGPIPE, running on macOS will inevitably hang in the same manner as we've seen with this test, while on an Ubuntu 20.04 machine, it run continuously.

From this I posit that this is some form of non-deterministic behavior with macOS's implementation of poll, but I'm not sure I can prove that without doing kernel debugging or doing a code review of the kernel code. This isn't the first time that there have been issues with macOS's poll implementation. I found that there were some issues with poll in 10.12 that were later fixed, but indicate that it may be a weak point.

All that said, and more to your line of questioning, Noah, I don't think this is purely a test issue. There is a small corner case that is macOS specific where the runtime can non-deterministically hang. I don't believe this to be high risk, though. My observations indicate this only happens on macOS and only happens in this pathological case. The test is specifically trying to simulate a crash-looping reverse server and more specifically an extreme case of that. Please let me know. if you see this as higher risk. I'll continue to look into this to see if there is a way to prove whether the issue is in the kernel.

@noahfalk
Copy link
Member

noahfalk commented Aug 5, 2020

Nice investigating @josalem! To be pedantic I think your discussion above is picking option (b)?

b) [Product Issue] It was intended to make the runtime work with a tool that does what the test did, however the value in doing so is not enough to justify the dev time, risk, or whatever else involved in resolving it

That feels like a very reasonable choice to me.

@josalem
Copy link
Contributor Author

josalem commented Aug 5, 2020

@noahfalk, yes, you are correct. A very long-winded way to say "option b" 😆 . I left these changes running for >8 hours with the timeout in the test being 5 minutes and hit no hangs. Barring any objection, I'll merge this in this afternoon.

@josalem josalem merged commit 5c29e14 into dotnet:master Aug 5, 2020
@josalem josalem deleted the dev/josalem/fix-39979 branch August 5, 2020 21:39
@BruceForstall
Copy link
Member

@josalem Is #38801 still an issue after this change? (I noticed the issue is still open and the test is disabled under GCStress)

@josalem
Copy link
Contributor Author

josalem commented Aug 5, 2020

No, it does not fix it. That is an unrelated issue that is due to the GC and the Thread Pool trying to take the same lock when under GCStress. I'm okay leaving the test off to avoid the assert, but I think the thread pool folks or GC folks should look at the issue in case it could happen outside GCStress.

Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this pull request Aug 10, 2020
lateralusX added a commit to lateralusX/runtime that referenced this pull request Aug 26, 2020
Port of dotnet#40225.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Tracing-coreclr os-mac-os-x macOS aka OSX test-bug Problem in test source code (most likely)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failures: reverseouter.sh, reverse.sh
5 participants