-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/exec: TestContextCancel flaky (on Windows?) #17245
Comments
FWIW, I looked for TestContextCancel starts a subprocess that prints strings from stdin back to stdout. It writes "echo" to the process once. Then it calls cancel() to kill that subprocess, and then it keeps writing "echo" to the subprocess until finally a write fails. It assumes that the write failing means the subprocess has died. Then it closes the pipe it was writing on. All this time there was a goroutine reading the subprocess's standard output and saving it by calling t.Log. The main goroutine waits for that reading goroutine to see EOF (again this should happen because the subprocess died). Now that we're doubly sure that the subprocess died, the main goroutine calls c.Wait. That should report the unsuccessful exit of the subprocess, but in the example you posted it gets nil instead. I have no idea why CL 29700 (use RtlGenRandom instead of CryptGenRandom) would make this fail more often. It's true that the actual c.Process.Kill happens asynchronously after cancel returns, but that kill should be what makes the write fail, and we don't do any of the other teardown until after the write fails. If the write failed for some reason other than the subprocess dying, then closing the writing pipe and waiting for the EOF on the reading side could possibly tear down the subprocess normally. You could imagine adding
to see exactly what error write gets. However, the read side is logging all the messages it gets back, and in the case of a graceful shutdown it should have gotten back the very first "echo" (sent before the cancel), and in the transcript there is nothing logged by the reading goroutine. This suggests the subprocess really did die, and c.Wait somehow failed to report that fact correctly. Looking at package os's (*Process).wait, I can't see how that could happen. However, this comment is concerning:
It really should not be the case that the process is not dead when WaitForSingleObject returns. On top of that, if the process were not dead, then GetExitCodeProcess would return ec=259 (STILL_ACTIVE), which would end up becoming a non-nil error from c.Wait. But c.Wait returned nil. So probably the process really was dead at least in this case. On the other hand, if somehow GetExitCodeProcess failed (returned false) but we didn't notice the failure, then maybe ec is uninitialized, so zero, which looks like success. I notice that runtime.asmstdcall returns a uintptr, which is the whole AX register after the DLL call. I wonder if maybe the functions that return a bool only guarantee to set the bottom 8 bits of that register (just AL), in which case the test r1 == 0 in syscall.GetExitCodeProcess would more properly be uint8(r1) == 0. If the high bits had garbage in them and were not expected to be used, that could manifest as a false "true" result. Whether this happened would probably be highly dependent on the specific code being called and what that code happens to have done or not done with the full AX register. If you can reproduce the problem on demand, maybe try uint8(r1) in GetExitCodeProcess and see if that changes anything. Otherwise I'm out of ideas. |
I can still reproduce this with +db82cf4:
occasionally, if I run test in a loop:
I do not buy your GetExitCodeProcess return code theory. GetExitCodeProcess returns BOOL, and BOOL is C int
https://msdn.microsoft.com/en-us/library/windows/desktop/aa383751(v=vs.85).aspx And if I start using Alex |
This has been happening pretty regularly on our new Windows XP builders, @rsc. https://build.golang.org/log/929cb405f2485a5071853e9912feb3ddcec91ab1 Also, very similar: And with both: |
/cc @ianlancetaylor for any thoughts |
I have seen this happen a lot, but always on windows xo, never on windows 7 or 10. I spent some time debugging this, but not a lot, given this only affects windows xo users. I am certain there is a bug in our code. Alex |
Change https://golang.org/cl/84175 mentions this issue: |
@bradfitz You may be onto something with that sleeping code. MSDN says that |
@ianlancetaylor, interesting! Good find. Will experiment. (That CL wasn't meant to fix this issue. I just discovered it along the way.) |
I found some discussion on the internet that maybe we shouldn't be closing the thread handle from CreateProcess right away. In syscall/exec_windows.go we do: pi := new(ProcessInformation)
flags := sys.CreationFlags | CREATE_UNICODE_ENVIRONMENT
if sys.Token != 0 {
err = CreateProcessAsUser(sys.Token, argv0p, argvp, nil, nil, true, flags, createEnvBlock(attr.Env), dirp, si, pi)
} else {
err = CreateProcess(argv0p, argvp, nil, nil, true, flags, createEnvBlock(attr.Env), dirp, si, pi)
}
if err != nil {
return 0, 0, err
}
defer CloseHandle(Handle(pi.Thread))
return int(pi.ProcessId Note that CreateProcess returns both a thread handle and a process handle, but we discard the thread handle. @alexbrainman, thoughts? |
http://www.cplusplus.com/forum/windows/48196/#msg261691 says:
That wording with "handles" plural makes me think they mean both thread+process handles. |
If you are talking about this fragment
from http://www.cplusplus.com/forum/windows/48196/ then the fragment have a bug - it closes process handle
We don't just "discard" the thread handle, we close the handle. I don't see the problem with that. According to https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425(v=vs.85).aspx : "... The only new interesting thing to this discussion is in http://www.cplusplus.com/forum/windows/48196/#msg261805 CodeMonkey suggests to check for Alex |
Related? haskell/process#77 |
@johnsonj, totally! Good find. |
Piecing together a lot of stuff here, but let me know if this makes sense: A child process (created by syscall.StartProcess) may spawn its own children. A parent process can exit before its child has exited. Waiting on the parent process' completion (os.Process.wait()) spawned by go does not guarantee that all of the children have shut down (and released their resources). For example:
To wait for an entire process tree to exit we need to create a job (CreateJobObject) that will be inherited by all children and is set to terminate when all processes exit (JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE) and create an completion port (CreateIoCompletionPort) that is signaled when the job is completed. When we spawn the process (CreateProcess) we need to start it suspended (CREATE_SUSPENDED), associate it with the job (AssignProcessToJobObject) then finally start its execution (ResumeThread). During Process.wait() we wait for the signal from the completion port (GetQueuedCompletionStatus instead of WaitForSingleObject). Once we receive JOB_OBJECT_MSG_EXIT_PROCESS we can call GetExitCodeProcess and return. Issues (potentially) fixed: #17245, #23171 If that sounds reasonable I can take a crack at implementing it. Sounds neat. |
Sounds plausible! Definitely worth implementing to see where it gets us. If that'd eliminate those test flakes and also remove that 5ms sleep in Wait, that'd be great. |
Change https://golang.org/cl/84896 mentions this issue: |
Just some rumbling ... I am not sure that we should change os.Process.Wait to wait for "all child process children to exit". I don't see os.Process.Wait documentation describe that. Image a child process that starts its own child (that never exits) and then exits immediately - I expect os.Process.Wait will return immediately now, but will hang forever if we change that behavior. Is it OK to change the behavior? If the solution turns out to be waiting for grandchildren processes, then the solution needs to be in cmd/go and not in os or os/exec. Alex |
As far as I can tell, this went from never happening on the XP builders to happening 100% of the time with CL 81895:
findflakes says 53%, but manually sampling the 47% where it "didn't" fail shows that it failed earlier and just didn't run this test at all. It seems unlikely that CL really affected this test, but the change caused by that CL may be useful in debugging the failure. |
@johnsonj, what's the status of this? |
I've been AFK for the holidays but can prototype this more next week. @alexbrainman point is spot on that we don't want to unconditionally wait on all child processes to exit so if this solution proves to fix the flake we'll need some sort of option for the tests to use. |
It's too late in Go 1.10 to make big changes to how process management works on Windows anyway, so moving to Go 1.11. But we'd really like to see this fixed and to remove that 5ms sleep. |
Change https://golang.org/cl/87257 mentions this issue: |
Updates #17245 Change-Id: I3d7ea362809040fbbba4b33efd57bf2d27d4c390 Reviewed-on: https://go-review.googlesource.com/87257 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
Change https://golang.org/cl/94255 mentions this issue: |
Per the notice in the Go 1.10 release notes, this change drops the support for Windows Vista or below (including Windows XP) and simplifies the code for the sake of maintenance. There is one exception to the above. The code related to DLL and system calls still remains in the runtime package. The remaining code will be refined and used for supporting upcoming Windows versions in future. Updates #17245 Fixes #23072 Change-Id: I9e2821721f25ef9b83dfbf85be2b7ee5d9023aa5 Reviewed-on: https://go-review.googlesource.com/94255 Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
Since Windows XP is not supported any more, maybe this issue should be closed? |
We can close this bug if there's another bug tracking removing the 5ms sleep on Windows. |
There isn't a bug tracking the removal afaik |
There is now. I filed #25965 Will close this bug. |
What version of Go are you using (
go version
)?go version devel +e6143e1 Mon Sep 26 01:51:31 2016 +0000 windows/386
What operating system and processor architecture are you using (
go env
)?set GOARCH=386
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=386
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=c:\dev
set GORACE=
set GOROOT=c:\dev\go
set GOTOOLDIR=c:\dev\go\pkg\tool\windows_386
set CC=gcc
set GOGCCFLAGS=-m32 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\DOCUME
1\brainman\LOCALS1\Temp\go-build338345670=/tmp/go-build -gno-record-gcc-switchesset CXX=g++
set CGO_ENABLED=1
What did you do?
I run
all.bat
.What did you expect to see?
all.bat run to successful completion.
What did you see instead?
This happens only occasionally. I discovered this while working on CL 29700 which makes this test fail more often.
If I understand TestContextCancel correctly, there is an expected race there between goroutine killing process and another goroutine checking that process was killed. And the error above seems like a wrong turn of that race. I don't see how the race can be avoided altogether. Not sure how to fix this. Maybe I am wrong about this.
Alex
The text was updated successfully, but these errors were encountered: