-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry named pipe connection on race condition #75
Retry named pipe connection on race condition #75
Conversation
If the named pipe server does not have an outstanding call to `ConnectNamedPipe()` at the time the client calls `CreateFile()` or `WaitNamedPipe()`, the client can receive `ERROR_FILE_NOT_FOUND` and `ERROR_BAD_PATHNAME` errors. When that's the case, there is no choice but to retry after a short delay in the hope that next time, the server will be able to accept the connection. This error occurs infrequently in low traffic situations, but it manifests quite reliably when clients open concurrent connections and/or open many short-lived connections in a short period of time.
@@ -159,6 +184,10 @@ func DialPipe(path string, timeout *time.Duration) (net.Conn, error) { | |||
} | |||
err = waitNamedPipe(path, ms) | |||
if err != nil { | |||
if shouldRetry(err) { | |||
time.Sleep(time.Duration(rand.Intn(10)) * time.Millisecond) | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we continue here, do we have to close h
to avoid a handle leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I misunderstood, CreateFile()
only allocates a handle when it succeeds and we never reach this path if it does:
for {
h, err = createFile(path, ...)
if err == nil {
break
}
...
err = waitNamedPipe(path, ...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Now I see it (on a bigger screen than my mobile phone, haha :-) ).
@StefanScherer I don't mean to put any pressure, but I'm hoping that this can get merged sometime soon. Please let me know if I can do anything to help speed up the process. |
Maybe @jstarks can have a look at your PR to improve stability opening a named pipe. 🙏 |
@jstarks I'm still hoping you might have a chance to review this sometime soon :-) |
Again, I don't mean to put any pressure, but this issue is blocking Windows support of the Terraform provider for Docker. I also (albeit rarely) encounter this issue with regular Docker usage on Windows when running scripts that quickly issue a series of Docker commands. If this fix is not suitable, I'll gladly improve it according to your suggestions, but I'd like to somehow make opening the named pipe more resilient in the near future :-) |
@jstarks @jhowardmsft @darstahl any update on this topic? What's the status of the |
This issue is responsible for failures in my test suite at AppVeyor, testing an IPC based transport for mangos (see github.com/nanomsg/mangos-v1/pull/318 and https://ci.appveyor.com/project/gdamore/mangos/build/1.0.187 ) |
These changes feel wrong to me. First off, there are legitimate reasons why you would get ERROR_FILE_NOT_FOUND or ERROR_BAD_PATHNAME, which indicate that truly there is nothing at the other end. Retrying forever is clearly incorrect. I'll comment on the diffs. |
if err == nil { | ||
break | ||
} | ||
if shouldRetry(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to me like it creates an infinite loop if there really is no pipe server present. That's bad.
There needs to be something to catch this. Probably what you could do is move this check to after the timekeeping validation.
@@ -159,6 +184,10 @@ func DialPipe(path string, timeout *time.Duration) (net.Conn, error) { | |||
} | |||
err = waitNamedPipe(path, ms) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd have restructured this so that you always loop around to CreateFile.
You don't even need to check for errors at all, because CreateFile does that.
What you do need to do is bail if the time is after the absolute timeout above. (Like 156/181). In fact, If the time out above has expired, you should simply bail, not go down into waitNamedPipe again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put another way, all waitNamedPipe() is doing for you is putting you to sleep a bit so you don't spin hard. The consequences of its error returns are simply not very interesting. In some other project I didn't even use this call at all, and just did a 10 ms sleep between retries...
@gdamore Thanks for the feedback.
Yes, that's pretty obvious to anybody from the name of the constants. However, the named pipe API makes no difference between "you caught the server at a bad time" and "there really is nothing at the other end". I honestly can't see how to retry to hide the former without changing the meaning of the latter.
Thanks for spotting this, it's clearly a major issue. Also, thanks for the suggestions, that loop structure sounds better than what I have here. We're now 6 weeks after I posted this and there has been no feedback until now. Is there any chance applying this feedback will get the PR merged? If not, then I'm unlikely to put more work into this. |
That's disappointing. I wonder if I should switch to npipe instead... I was hoping that being from Microsoft that this would get a bit more "official" support. |
I'd offer to write up the necessary code changes myself (heck almost tempted to fork this project), but if I do it that means getting a CLA, which is a PITA and probably takes longer. Someone should tell Microsoft that CLAs are bad juju in the open source world. (If buying GitHub means that there will be more CLA stuff on GitHub, that's a bad thing.) |
Having said that, I just read Microsoft's CLA, and its pretty benign. It doesn't seem to do anything other than reassert the terms in the license (but adds a patent grant clause, which an Apache license would do anyway), along with an assertion that the IP you're contributing is yours to do so with. Specifically it doesn't seem to grant microsoft ownership of the IP itself. |
The lack of activity here is depressing. Maybe it's time to fork this project? On the one hand its only been a about a month and a half since the last commit to this project. On the other hand... we're talking about a Microsoft repository. They should care about stuff carrying their name, and there should be more than one employee there with the ability to at least comment on this stuff. |
Actually, what I'm going to do is fork this, and submit a new PR that properly addresses this. I'm pretty sure that if you implement things correctly, you will get ERROR_FILE_NOT_FOUND only if the pipe hasn't had any instances created. If the code is done correctly, there should always be an instance waiting around. (This may mean that we need the listener to be modified to do that... I've got code that does this for my own NNG C based project.) Then you only need to treat ERROR_PIPE_BUSY as a transient error that you retry. |
@gdamore Feel free to keep us updated here |
I have a new theory. Which is kind of ugly, but I think it may work. The best way to solve this would have been if CreateFile didn't return success unless a matching ConnectNamedPipe was made, but it appears that the Win32 API is designed to match CreateFile with CreateNamedPipe instead of ConnectNamedPipe. I suspect that the reason we are seeing problems in containers is different handling of closed pipes -- the original code opens a named pipe client to connect it with FirstHandle. But then closes it. I think this actually leaves the pipe not in listening state, leading to no active pipes at some times. As a workaround, I think simply not closing the scratch pipe that was opened would do the trick. That would then need to be closed only when we are closing the listener. Stay tuned. |
Its worse, because the use of WaitNamedPipe is indeed racy. The semantic of specifying 0 to get a server determined timeout is highly problematic, since there isn't really a way to make that work reliably that isn't racy in the existing Win32 API. |
So what I'm doing is creating a "default" timeout of 5 seconds, and spinning at 10 msec intervals. This is still not perfect because a super-busy pipe server can "miss" connections (one of the connection requests can timeout due to starvation), but its really the best we can do without a listen queue like TCP has. 5 seconds was chosen as a default to improve the likelihood of success, and 10 ms represents something approaching a clock tick on most platforms. |
Please have a look at #80 -- I took a rather different approach, and I would like to have folks try it out to see if it makes their lives better. |
Closing in favor of #80. |
If the named pipe server does not have an outstanding call to
ConnectNamedPipe()
at the time the client callsCreateFile()
orWaitNamedPipe()
, the client can receiveERROR_FILE_NOT_FOUND
andERROR_BAD_PATHNAME
errors. When that's the case, there is no choice but to retry after a short delay in the hope that next time, the server will be able to accept the connection.This error occurs infrequently in low traffic situations, but it manifests quite reliably when clients open concurrent connections and/or open many short-lived connections in a short period of time.
Fixes #67.