Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry named pipe connection on race condition #75

Closed
wants to merge 1 commit into from
Closed

Retry named pipe connection on race condition #75

wants to merge 1 commit into from

Conversation

AndreLouisCaron
Copy link

If the named pipe server does not have an outstanding call to ConnectNamedPipe() at the time the client calls CreateFile() or WaitNamedPipe(), the client can receive ERROR_FILE_NOT_FOUND and ERROR_BAD_PATHNAME errors. When that's the case, there is no choice but to retry after a short delay in the hope that next time, the server will be able to accept the connection.

This error occurs infrequently in low traffic situations, but it manifests quite reliably when clients open concurrent connections and/or open many short-lived connections in a short period of time.

Fixes #67.

If the named pipe server does not have an outstanding call to
`ConnectNamedPipe()` at the time the client calls `CreateFile()`
or `WaitNamedPipe()`, the client can receive `ERROR_FILE_NOT_FOUND`
and `ERROR_BAD_PATHNAME` errors.  When that's the case, there is
no choice but to retry after a short delay in the hope that next
time, the server will be able to accept the connection.

This error occurs infrequently in low traffic situations, but it
manifests quite reliably when clients open concurrent connections
and/or open many short-lived connections in a short period of
time.
@msftclas
Copy link

msftclas commented Apr 21, 2018

CLA assistant check
All CLA requirements met.

@@ -159,6 +184,10 @@ func DialPipe(path string, timeout *time.Duration) (net.Conn, error) {
}
err = waitNamedPipe(path, ms)
if err != nil {
if shouldRetry(err) {
time.Sleep(time.Duration(rand.Intn(10)) * time.Millisecond)
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we continue here, do we have to close h to avoid a handle leak?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I misunderstood, CreateFile() only allocates a handle when it succeeds and we never reach this path if it does:

for {
    h, err = createFile(path, ...)
    if err == nil {
        break
    }
    ...
    err = waitNamedPipe(path, ...)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Now I see it (on a bigger screen than my mobile phone, haha :-) ).

@AndreLouisCaron
Copy link
Author

@StefanScherer I don't mean to put any pressure, but I'm hoping that this can get merged sometime soon. Please let me know if I can do anything to help speed up the process.

@StefanScherer
Copy link

Maybe @jstarks can have a look at your PR to improve stability opening a named pipe. 🙏

@AndreLouisCaron
Copy link
Author

@jstarks I'm still hoping you might have a chance to review this sometime soon :-)

@AndreLouisCaron
Copy link
Author

Again, I don't mean to put any pressure, but this issue is blocking Windows support of the Terraform provider for Docker.

I also (albeit rarely) encounter this issue with regular Docker usage on Windows when running scripts that quickly issue a series of Docker commands.

If this fix is not suitable, I'll gladly improve it according to your suggestions, but I'd like to somehow make opening the named pipe more resilient in the near future :-)

@deviantony
Copy link

@jstarks @jhowardmsft @darstahl any update on this topic? What's the status of the go-winio library overall ?

@gdamore
Copy link
Contributor

gdamore commented Jun 7, 2018

This issue is responsible for failures in my test suite at AppVeyor, testing an IPC based transport for mangos (see github.com/nanomsg/mangos-v1/pull/318 and https://ci.appveyor.com/project/gdamore/mangos/build/1.0.187 )

@gdamore
Copy link
Contributor

gdamore commented Jun 8, 2018

These changes feel wrong to me.

First off, there are legitimate reasons why you would get ERROR_FILE_NOT_FOUND or ERROR_BAD_PATHNAME, which indicate that truly there is nothing at the other end. Retrying forever is clearly incorrect.

I'll comment on the diffs.

if err == nil {
break
}
if shouldRetry(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to me like it creates an infinite loop if there really is no pipe server present. That's bad.

There needs to be something to catch this. Probably what you could do is move this check to after the timekeeping validation.

@@ -159,6 +184,10 @@ func DialPipe(path string, timeout *time.Duration) (net.Conn, error) {
}
err = waitNamedPipe(path, ms)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd have restructured this so that you always loop around to CreateFile.
You don't even need to check for errors at all, because CreateFile does that.

What you do need to do is bail if the time is after the absolute timeout above. (Like 156/181). In fact, If the time out above has expired, you should simply bail, not go down into waitNamedPipe again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way, all waitNamedPipe() is doing for you is putting you to sleep a bit so you don't spin hard. The consequences of its error returns are simply not very interesting. In some other project I didn't even use this call at all, and just did a 10 ms sleep between retries...

@AndreLouisCaron
Copy link
Author

@gdamore Thanks for the feedback.

First off, there are legitimate reasons why you would get ERROR_FILE_NOT_FOUND or ERROR_BAD_PATHNAME, which indicate that truly there is nothing at the other end

Yes, that's pretty obvious to anybody from the name of the constants. However, the named pipe API makes no difference between "you caught the server at a bad time" and "there really is nothing at the other end". I honestly can't see how to retry to hide the former without changing the meaning of the latter.

This looks to me like it creates an infinite loop if there really is no pipe server present. [...] Probably what you could do is move this check to after the timekeeping validation. [...] What you do need to do is bail if the time is after the absolute timeout above.

Thanks for spotting this, it's clearly a major issue. Also, thanks for the suggestions, that loop structure sounds better than what I have here.


We're now 6 weeks after I posted this and there has been no feedback until now. Is there any chance applying this feedback will get the PR merged? If not, then I'm unlikely to put more work into this.

@gdamore
Copy link
Contributor

gdamore commented Jun 8, 2018

That's disappointing. I wonder if I should switch to npipe instead... I was hoping that being from Microsoft that this would get a bit more "official" support.

@gdamore
Copy link
Contributor

gdamore commented Jun 9, 2018

I'd offer to write up the necessary code changes myself (heck almost tempted to fork this project), but if I do it that means getting a CLA, which is a PITA and probably takes longer. Someone should tell Microsoft that CLAs are bad juju in the open source world. (If buying GitHub means that there will be more CLA stuff on GitHub, that's a bad thing.)

@gdamore
Copy link
Contributor

gdamore commented Jun 9, 2018

Having said that, I just read Microsoft's CLA, and its pretty benign. It doesn't seem to do anything other than reassert the terms in the license (but adds a patent grant clause, which an Apache license would do anyway), along with an assertion that the IP you're contributing is yours to do so with. Specifically it doesn't seem to grant microsoft ownership of the IP itself.

@gdamore
Copy link
Contributor

gdamore commented Jun 24, 2018

The lack of activity here is depressing. Maybe it's time to fork this project? On the one hand its only been a about a month and a half since the last commit to this project. On the other hand... we're talking about a Microsoft repository. They should care about stuff carrying their name, and there should be more than one employee there with the ability to at least comment on this stuff.

@gdamore
Copy link
Contributor

gdamore commented Jun 24, 2018

Actually, what I'm going to do is fork this, and submit a new PR that properly addresses this. I'm pretty sure that if you implement things correctly, you will get ERROR_FILE_NOT_FOUND only if the pipe hasn't had any instances created. If the code is done correctly, there should always be an instance waiting around. (This may mean that we need the listener to be modified to do that... I've got code that does this for my own NNG C based project.) Then you only need to treat ERROR_PIPE_BUSY as a transient error that you retry.

@deviantony
Copy link

@gdamore Feel free to keep us updated here

@gdamore
Copy link
Contributor

gdamore commented Jun 24, 2018

I have a new theory. Which is kind of ugly, but I think it may work. The best way to solve this would have been if CreateFile didn't return success unless a matching ConnectNamedPipe was made, but it appears that the Win32 API is designed to match CreateFile with CreateNamedPipe instead of ConnectNamedPipe.

I suspect that the reason we are seeing problems in containers is different handling of closed pipes -- the original code opens a named pipe client to connect it with FirstHandle. But then closes it. I think this actually leaves the pipe not in listening state, leading to no active pipes at some times.

As a workaround, I think simply not closing the scratch pipe that was opened would do the trick. That would then need to be closed only when we are closing the listener.

Stay tuned.

@gdamore
Copy link
Contributor

gdamore commented Jun 24, 2018

Its worse, because the use of WaitNamedPipe is indeed racy. The semantic of specifying 0 to get a server determined timeout is highly problematic, since there isn't really a way to make that work reliably that isn't racy in the existing Win32 API.

@gdamore
Copy link
Contributor

gdamore commented Jun 24, 2018

So what I'm doing is creating a "default" timeout of 5 seconds, and spinning at 10 msec intervals. This is still not perfect because a super-busy pipe server can "miss" connections (one of the connection requests can timeout due to starvation), but its really the best we can do without a listen queue like TCP has. 5 seconds was chosen as a default to improve the likelihood of success, and 10 ms represents something approaching a clock tick on most platforms.

@gdamore
Copy link
Contributor

gdamore commented Jun 24, 2018

Please have a look at #80 -- I took a rather different approach, and I would like to have folks try it out to see if it makes their lives better.

@gdamore
Copy link
Contributor

gdamore commented Jul 13, 2018

I recommend the submitter to close this PR. It clearly is wrong, and I believe #80 is far superior. I'm worried that the presence of this open PR may hinder the timely acceptance of #80. I think we all want to see forward progress here.

@AndreLouisCaron
Copy link
Author

Closing in favor of #80.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants