-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes #67 DialPipe problem with multiple calls / waiting for busy pipe #80
Conversation
… busy pipe This changes a few things to try to ensure that we never wind up with a result of ERROR_FILE_NOT_FOUND, due to a race between closing the last pipe instance and opening the next. First we keep an "open" client instance (unused) while the listener is open, so that we are guaranteed to always have an active pipe instance. This means attempts to open while no other instances exist result in ERROR_PIPE_BUSY instead of ERROR_FILE_NOT_FOUND. Second we have changed the loop for dialing to eliminate a race condition that is more or less inherent in WaitNamedPipe when synchronizing with CreateFile. The real timeout needs to be some larger value than the WaitNamedPipe timeout, and furthermore WaitNamedPipe is not very nice with the Go runtime, since it is a blocking system call. Instead we just put the goroutine to sleep for 10 milliseconds, and keep retrying the CreateFile until the maximum timeout is reached. If no timeout is specified we assume a reasonable and large default of 5 seconds, which is similar to a TCP connection timeout. This isn't perfect, as a client attempting to connect to an extremely busy pipe server can be starved out by other clients coming in while it is in that brief sleep, but this potential race was already present with WaitNamedPipe. The numerous retries (by default 500 retries!) mean its pretty unlikely to occur, and if a single client hits the race once, it has an excellent chance of getting in the next cycle. (A real "fix" that is completely race free and fair would require changes in the underlying Named Pipe implementation, or some other kind of external coordination.)
Hmm... I probably would have liked to have added my own copyright and update to the code... (still MIT licensed.) If its possible to do that somewhere, text more or less to the effect of: Portions Copyright 2018 Garrett D'Amore [email protected] (or skip the Portions word.) . If this is too much a hassle, then don't worry about it. |
This probably also fixes #46 -- since WaitNamedPipe was the only blocking code remaining. |
FYI. I was able to reproduce original issue by running my test version of Portainer on Windows Server, version 1803 and I cannot see it anymore on version which contains content from this PR (more detail on that portainer PR). Any change to get this one merged? |
I can confirm that this PR helps Portainer work stable with the named pipe bind mounted into a Windows container running a Docker swarm on Windows Server 1803 with Docker EE 18.03.1-ee-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the most robust fix given how named pipes work on Windows.
@jhowardmsft @darstahl I can see that both of you have merged stuff to this repo earlier. It looks that @jstarks is away. |
it's time to allow merge this pull ? @johnstep |
I do not have permission to merge this. |
@jstarks time to allow merge this pull ? |
Hard to believe I submitted this 19 days ago; the silence from the repo maintainers is deafening. |
Sorry for the delays. I'm looking at this now. I want to write a quick test to understand the behavior of something before I merge this. |
OK, I think this pull request is actually three changes:
I understand why increasing the timeout may be useful, but this seems like something the client can easily do without a change to go-winio. I do suspect that the default 50ms is probably too short to be useful given the problems with fairness. But I'm worried that extending it to a full 5 seconds might affect existing clients negatively. Maybe a compromise would be something like 250ms. WaitNamedPipe has its problems. As you mention, it blocks the OS thread (which go-winio generally tries to avoid), it has various races, it does not guarantee that the client will win the race and be able to connect to the pipe, and most importantly for #67, it does not work reliably inside containers. It does have the advantage, though, that it wakes up immediately when a connection is available. I wonder if a better change wouldn't be to keep the WaitNamedPipe call, but if it fails any reason other than timeout, to sleep for 10ms and loop around again. That would probably resolve #67 and other race conditions. This suggestion doesn't fix the issue that WaitNamedPipe blocks the OS thread, of course. If this is important to resolve, I would suggest making a change to use I don't understand the third change. I can't see any behavior differences whether I keep the original client handle open or not. Unless you are aware of a behavior that I am not, I think it's better to close that client handle to avoid unnecessary resource consumption. |
@jstarks thanks for good comments. I changed default timeout to 250 ms and added client handle closing now on this commit: https://github.com/olljanat/go-winio/commit/972aaec17501edc2ae66b43541f799eec50cf7c5 and did some testing with Portainer and I can tell that this combination at least still fixes #67 @gdamore I think that it is better if you comment about suggestion to still use WaitNamedPipe? |
I think using WaitNamedPipe is probably more trouble than it's worth. In
theory it would be better with the immediate wake up but as it doesn't
actually correct anything and as it blocks the caller I think it's better
not to use it and aim for simplicity.
5s was chosen to match TCP timeouts making this behave more like other
net.Pipe implementations. That said it is trivial to override as that is
just a default so I am happy with whatever you think is best there.
Keeping the old pipe open was intended to make it easier to distinguish
between the two cases of a listener not existing at all and a listener that
has not yet accepted the connection. The former case you want to fail fast
without a time out. The latter should keep trying because presumably
*something* is there and you have just missed it.
Again the semantic I am trying to provide is more like TCP or UNIX domain
pipes which both behave this way.
…On Sat, Jul 14, 2018, 7:50 AM Olli Janatuinen ***@***.***> wrote:
@jstarks <https://github.com/jstarks> thanks for good comments.
I changed default timeout to 250 ms and added client handle closing now on
this commit: ***@***.***
<olljanat/go-winio@972aaec>
and did some testing with Portainer and I can tell that this combination at
least still fixes #67 <#67>
@gdamore <https://github.com/gdamore> I think that it is better if you
comment about suggestion to still use WaitNamedPipe?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#80 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPDfRMoGmzocewPA8PnfX8JKpKGOi12ks5uGgU1gaJpZM4U1QgD>
.
|
#75 should be closed IMO, its just dead wrong. This one may have an issue that I had not considered, but while working at the C level in another library I encountered, so it may need work. |
Actually rereading the code (and refreshing my memory), we do open the instance with a client, meaning we keep two connections alive, so I think this is OK. There is a really subtle possible race here, which is that if a client hits us right after we create the named pipe, and before our client can connect, we'll wind up failing the bind, as the client connection fails. This race means that a client has to race against the server doing listen. A fix fort that, which I would be happy to follow up with, would be to check on the server side if the connected client is us, and to disconnect the remote client it if isn't us. We can detect that pretty easily by setting a flag in the listener when our client connects. I believe that the changes here are better than what we had before, and the above refinement would just be a further improvement. The changes in #84 are architecturally identical to what I've used in another software stack to workaround the same problems. Having said that, the code there is quite a bit more complex, harder to parse, and I really don't like that the transient error from CreateNamedPipe being deferred to the next Accept. (In my other code , I simply forcibly disconnect the client if this occurs.) Note that #84 doesn't address all of the problems I've addressed here, specifically the use of WaitNamedPipe is problematic, and we can race and lose on the client side. The changes in #84 do address the concern of keeping the stake on the pipe, which is one of the elements that are also fixed by my changes here. Upshot here, is that #84 is architecturally acceptable to me as a fix for one of these issues, but needs some further work IMO. It is however incomplete with respect to the full dimensions of problems. As indicated, the changes here are also incomplete (niggling possible race) at Listen() time, but easily correctable. I'm somewhat disinclined to invest further on this without clearer signals indicating that the work is likely to be useful and integrated -- I don't want to spend cycles on a PR that is going to just get dropped in favor of a different approach (or in favor of nothing at all, although if something doesn't get integrated I'll need to fork this for my own software.) |
Hmm.. I have a question. If the client handle is closed, but we haven't called DisconnectNamedPipe nor closed the server handle, is the server pipe instance still retained and busy (so no new client handles will connect to it, and so that the we won't get ERROR_NOT_FOUND in CreateFile on the client?) If so, then closing the file handle would be perfectly reasonable. In such a case, there would be no need to retain the client handle. A better comment explaining this would be helpful. There is still that race condition I mentioned, where some other clients connects before we do. That would be unfortunate, and again would be easily fixed in a follow up. |
What I've observed (and I think we have tests to confirm) is that keeping the server handle open is sufficient to retain ownership of the pipe and to ensure that clients get ERROR_PIPE_BUSY. So keeping the client handle open is not necessary. Edit: agreed that a better comment would be useful here. |
I'm inclined to take this change with the following tweaks:
In the future I would like to reintroduce the WaitNamedPipe behavior using the FSCTL to avoid blocking the thread. But I don't think we need to hold this change for that now. Agreed that there is an existing race in Listen() that this change doesn't fix. We can defer that for another change. |
This sounds great! If you will integrate this change, then I will follow up with a PR to fix that Listen() race this week. |
Do you want me to follow up with a modification to this PR for the above alterations (2 second timeout, close client handle) or address at your end? |
If you have the time, I'd appreciate it. |
Ok, coming shortly. |
We can safely close the client handle, because the server side is not closed, and thsi keeps our hold on the pipe instance.
Done. Would be good to test it before integrating. :-) |
Thanks! I'll take a look a little later today and merge it. If @olljanat has time to validate it in his workload, that would be useful. |
The timeout value was changed from 5 to 2 seconds in microsoft#80
(I believe this change also should be used instead of PR #75 -- that PR is potentially extraordinarily buggy.)
This changes a few things to try to ensure that we never wind up with
a result of ERROR_FILE_NOT_FOUND, due to a race between closing the
last pipe instance and opening the next.
First we keep an "open" client instance (unused) while the listener
is open, so that we are guaranteed to always have an active pipe
instance. This means attempts to open while no other instances exist
result in ERROR_PIPE_BUSY instead of ERROR_FILE_NOT_FOUND.
Second we have changed the loop for dialing to eliminate a race condition
that is more or less inherent in WaitNamedPipe when synchronizing with
CreateFile. The real timeout needs to be some larger value than the
WaitNamedPipe timeout, and furthermore WaitNamedPipe is not very nice
with the Go runtime, since it is a blocking system call. Instead we
just put the goroutine to sleep for 10 milliseconds, and keep retrying
the CreateFile until the maximum timeout is reached. If no timeout is
specified we assume a reasonable and large default of 5 seconds, which is
similar to a TCP connection timeout.
This isn't perfect, as a client attempting to connect to an extremely
busy pipe server can be starved out by other clients coming in while
it is in that brief sleep, but this potential race was already present
with WaitNamedPipe. The numerous retries (by default 500 retries!)
mean its pretty unlikely to occur, and if a single client hits the
race once, it has an excellent chance of getting in the next cycle.
(A real "fix" that is completely race free and fair would require
changes in the underlying Named Pipe implementation, or some other
kind of external coordination.)