Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: TestContextCancel flaky (on Windows?) #17245

Closed
alexbrainman opened this issue Sep 27, 2016 · 27 comments
Closed

os/exec: TestContextCancel flaky (on Windows?) #17245

alexbrainman opened this issue Sep 27, 2016 · 27 comments
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@alexbrainman
Copy link
Member

What version of Go are you using (go version)?

go version devel +e6143e1 Mon Sep 26 01:51:31 2016 +0000 windows/386

What operating system and processor architecture are you using (go env)?

set GOARCH=386
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=386
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=c:\dev
set GORACE=
set GOROOT=c:\dev\go
set GOTOOLDIR=c:\dev\go\pkg\tool\windows_386
set CC=gcc
set GOGCCFLAGS=-m32 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\DOCUME1\brainman\LOCALS1\Temp\go-build338345670=/tmp/go-build -gno-record-gcc-switches
set CXX=g++
set CGO_ENABLED=1

What did you do?

I run all.bat.

What did you expect to see?

all.bat run to successful completion.

What did you see instead?

ok      os      2.015s
--- FAIL: TestContextCancel (0.03s)
        exec_test.go:946: program unexpectedly exited successfully
FAIL
FAIL    os/exec 15.669s

This happens only occasionally. I discovered this while working on CL 29700 which makes this test fail more often.

If I understand TestContextCancel correctly, there is an expected race there between goroutine killing process and another goroutine checking that process was killed. And the error above seems like a wrong turn of that race. I don't see how the race can be avoided altogether. Not sure how to fix this. Maybe I am wrong about this.

Alex

@bradfitz bradfitz changed the title os/exec: TestContextCancel fails with "program unexpectedly exited successfully" os/exec: TestContextCancel flaky (on Windows?) Sep 27, 2016
@bradfitz bradfitz added Testing An issue that has been verified to require only test changes, not just a test failure. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 27, 2016
@bradfitz bradfitz added this to the Go1.8 milestone Sep 27, 2016
@rsc
Copy link
Contributor

rsc commented Oct 27, 2016

FWIW, I looked for exec_test.*program.unexpectedly.exited in all the build failure logs going back to early this year, and I don't see any such failures. I don't doubt that this happened to you, @alexbrainman, but it's interesting it is not happening on the builders.

TestContextCancel starts a subprocess that prints strings from stdin back to stdout. It writes "echo" to the process once. Then it calls cancel() to kill that subprocess, and then it keeps writing "echo" to the subprocess until finally a write fails. It assumes that the write failing means the subprocess has died. Then it closes the pipe it was writing on. All this time there was a goroutine reading the subprocess's standard output and saving it by calling t.Log. The main goroutine waits for that reading goroutine to see EOF (again this should happen because the subprocess died). Now that we're doubly sure that the subprocess died, the main goroutine calls c.Wait. That should report the unsuccessful exit of the subprocess, but in the example you posted it gets nil instead.

I have no idea why CL 29700 (use RtlGenRandom instead of CryptGenRandom) would make this fail more often.

It's true that the actual c.Process.Kill happens asynchronously after cancel returns, but that kill should be what makes the write fail, and we don't do any of the other teardown until after the write fails.

If the write failed for some reason other than the subprocess dying, then closing the writing pipe and waiting for the EOF on the reading side could possibly tear down the subprocess normally. You could imagine adding

diff --git a/src/os/exec/exec_test.go b/src/os/exec/exec_test.go
index b143437..d02051a 100644
--- a/src/os/exec/exec_test.go
+++ b/src/os/exec/exec_test.go
@@ -949,6 +949,7 @@ func TestContextCancel(t *testing.T) {
    start := time.Now()
    for {
        if _, err := io.WriteString(w, "echo"); err != nil {
+           t.Logf("write: %v", err)
            break
        }
        if time.Since(start) > time.Second {

to see exactly what error write gets. However, the read side is logging all the messages it gets back, and in the case of a graceful shutdown it should have gotten back the very first "echo" (sent before the cancel), and in the transcript there is nothing logged by the reading goroutine. This suggests the subprocess really did die, and c.Wait somehow failed to report that fact correctly.

Looking at package os's (*Process).wait, I can't see how that could happen. However, this comment is concerning:

// NOTE(brainman): It seems that sometimes process is not dead
// when WaitForSingleObject returns. But we do not know any
// other way to wait for it. Sleeping for a while seems to do
// the trick sometimes. So we will sleep and smell the roses.
defer time.Sleep(5 * time.Millisecond)
defer p.Release()

It really should not be the case that the process is not dead when WaitForSingleObject returns. On top of that, if the process were not dead, then GetExitCodeProcess would return ec=259 (STILL_ACTIVE), which would end up becoming a non-nil error from c.Wait. But c.Wait returned nil. So probably the process really was dead at least in this case.

On the other hand, if somehow GetExitCodeProcess failed (returned false) but we didn't notice the failure, then maybe ec is uninitialized, so zero, which looks like success.

I notice that runtime.asmstdcall returns a uintptr, which is the whole AX register after the DLL call. I wonder if maybe the functions that return a bool only guarantee to set the bottom 8 bits of that register (just AL), in which case the test r1 == 0 in syscall.GetExitCodeProcess would more properly be uint8(r1) == 0. If the high bits had garbage in them and were not expected to be used, that could manifest as a false "true" result. Whether this happened would probably be highly dependent on the specific code being called and what that code happens to have done or not done with the full AX register.

If you can reproduce the problem on demand, maybe try uint8(r1) in GetExitCodeProcess and see if that changes anything.

Otherwise I'm out of ideas.

@alexbrainman
Copy link
Member Author

I can still reproduce this with +db82cf4:

diff --git a/src/os/exec.go b/src/os/exec.go
index 8a53e5d..148e308 100644
--- a/src/os/exec.go
+++ b/src/os/exec.go
@@ -15,6 +15,8 @@ import (
 // Process stores the information about a process created by StartProcess.
 type Process struct {
    Pid    int
+   Alex   error
+   Alex2  syscall.Rusage
    handle uintptr      // handle is accessed atomically on Windows
    isdone uint32       // process has been successfully waited on, non zero if true
    sigMu  sync.RWMutex // avoid race between wait and signal
diff --git a/src/os/exec/exec_test.go b/src/os/exec/exec_test.go
index 8d44401..b24e357 100644
--- a/src/os/exec/exec_test.go
+++ b/src/os/exec/exec_test.go
@@ -924,11 +924,13 @@ func TestContextCancel(t *testing.T) {

    cancel()

+   var alexerr error
    // Calling cancel should have killed the process, so writes
    // should now fail.  Give the process a little while to die.
    start := time.Now()
    for {
        if _, err := io.WriteString(w, "echo"); err != nil {
+           alexerr = err
            break
        }
        if time.Since(start) > time.Second {
@@ -943,6 +945,10 @@ func TestContextCancel(t *testing.T) {
    <-readDone

    if err := c.Wait(); err == nil {
+       fmt.Printf("c.Process.Alex: %v\n", c.Process.Alex)
+       fmt.Printf("c.Process.Alex2: %+v\n", c.Process.Alex2)
+       fmt.Printf("c.ProcessState: %#v\n", c.ProcessState)
+       fmt.Printf("write: %v\n", alexerr)
        t.Error("program unexpectedly exited successfully")
    } else {
        t.Logf("exit status: %v", err)
diff --git a/src/os/exec_windows.go b/src/os/exec_windows.go
index d89db20..c849e55 100644
--- a/src/os/exec_windows.go
+++ b/src/os/exec_windows.go
@@ -26,6 +26,7 @@ func (p *Process) wait() (ps *ProcessState, err error) {
    }
    var ec uint32
    e = syscall.GetExitCodeProcess(syscall.Handle(handle), &ec)
+   p.Alex = e
    if e != nil {
        return nil, NewSyscallError("GetExitCodeProcess", e)
    }
@@ -34,6 +35,7 @@ func (p *Process) wait() (ps *ProcessState, err error) {
    if e != nil {
        return nil, NewSyscallError("GetProcessTimes", e)
    }
+   p.Alex2 = u
    p.setDone()
    // NOTE(brainman): It seems that sometimes process is not dead
    // when WaitForSingleObject returns. But we do not know any

occasionally, if I run test in a loop:

go test -v -i && for /l %x in (1, 1, 100) do go test -short os/exec
...
C:\dev\go\src\os\exec>go test -short
c.Process.Alex: <nil>
c.Process.Alex2: {CreationTime:{LowDateTime:38306894 HighDateTime:30552088} ExitTime:{LowDateTime:38306894 HighDateTime:30552088} KernelTime:{LowDateTime:0 HighDateTime:0} UserTime:{LowDateTime:0 HighDateTime:0}}
c.ProcessState: &os.ProcessState{pid:1732, status:syscall.WaitStatus{ExitCode:0x0}, rusage:(*syscall.Rusage)(0x1102ca80)}
write: write |1: The pipe is being closed.
--- FAIL: TestContextCancel (0.00s)
        exec_test.go:952: program unexpectedly exited successfully
FAIL
exit status 1
FAIL    os/exec 4.422s

C:\dev\go\src\os\exec>go test -short
c.Process.Alex: <nil>
c.Process.Alex2: {CreationTime:{LowDateTime:91900987 HighDateTime:30552088} ExitTime:{LowDateTime:91900987 HighDateTime:30552088} KernelTime:{LowDateTime:0 HighDateTime:0} UserTime:{LowDateTime:0 HighDateTime:0}}
c.ProcessState: &os.ProcessState{pid:5976, status:syscall.WaitStatus{ExitCode:0x0}, rusage:(*syscall.Rusage)(0x10fdc580)}
write: write |1: The pipe is being closed.
--- FAIL: TestContextCancel (0.00s)
        exec_test.go:952: program unexpectedly exited successfully
FAIL
exit status 1
FAIL    os/exec 4.109s

C:\dev\go\src\os\exec>go test -short
...

I do not buy your GetExitCodeProcess return code theory. GetExitCodeProcess returns BOOL, and BOOL is C int

typedef int BOOL;

https://msdn.microsoft.com/en-us/library/windows/desktop/aa383751(v=vs.85).aspx
so it should be the same as uintptr.

And if I start using uint8(r1) == 0 in syscall.GetExitCodeProcess the test still fails for me.

Alex

@bradfitz
Copy link
Contributor

/cc @ianlancetaylor for any thoughts

@alexbrainman
Copy link
Member Author

I have seen this happen a lot, but always on windows xo, never on windows 7 or 10. I spent some time debugging this, but not a lot, given this only affects windows xo users. I am certain there is a bug in our code.

Alex

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/84175 mentions this issue: os: remove unconditional wait sleep on Windows, do it as needed

@ianlancetaylor
Copy link
Contributor

@bradfitz You may be onto something with that sleeping code. MSDN says that GetExitCodeProcess only returns an exit code after the thread terminates. Maybe you should try moving that sleep/SYNCHRONIZE loop up before the call to GetExitCodeProcess.

@bradfitz
Copy link
Contributor

@ianlancetaylor, interesting! Good find. Will experiment.

(That CL wasn't meant to fix this issue. I just discovered it along the way.)

@bradfitz
Copy link
Contributor

I found some discussion on the internet that maybe we shouldn't be closing the thread handle from CreateProcess right away.

In syscall/exec_windows.go we do:

        pi := new(ProcessInformation)

        flags := sys.CreationFlags | CREATE_UNICODE_ENVIRONMENT
        if sys.Token != 0 {
                err = CreateProcessAsUser(sys.Token, argv0p, argvp, nil, nil, true, flags, createEnvBlock(attr.Env), dirp, si, pi)
        } else {
                err = CreateProcess(argv0p, argvp, nil, nil, true, flags, createEnvBlock(attr.Env), dirp, si, pi)
        }
        if err != nil {
                return 0, 0, err
        }
        defer CloseHandle(Handle(pi.Thread))
        
        return int(pi.ProcessId

Note that CreateProcess returns both a thread handle and a process handle, but we discard the thread handle.

@alexbrainman, thoughts?

@bradfitz
Copy link
Contributor

http://www.cplusplus.com/forum/windows/48196/#msg261691 says:

Edit : to answer my own question. Do not close handles prior to checking the return code.

That wording with "handles" plural makes me think they mean both thread+process handles.

@alexbrainman
Copy link
Member Author

I found some discussion on the internet that maybe we shouldn't be closing the thread handle from CreateProcess right away.

If you are talking about this fragment

   WaitForSingleObject(piProcInfo.hProcess, timeOut_in);
   CloseHandle(piProcInfo.hProcess);
   CloseHandle(piProcInfo.hThread);
   DWORD exitCode;
   if(!GetExitCodeProcess(piProcInfo.hProcess, &exitCode)){
	   printf("Exit code = %X\n", exitCode);
   }

from http://www.cplusplus.com/forum/windows/48196/ then the fragment have a bug - it closes process handle piProcInfo.hProcess and then uses it when calling GetExitCodeProcess. That is undefined behavior.

Note that CreateProcess returns both a thread handle and a process handle, but we discard the thread handle.

We don't just "discard" the thread handle, we close the handle. I don't see the problem with that. According to https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425(v=vs.85).aspx : "...
Handles in PROCESS_INFORMATION must be closed with CloseHandle when they are no longer needed. ...". So that is what we do. Even last message in http://www.cplusplus.com/forum/windows/48196/#msg261805 says: "I do usually close the thread handle immediately after a CreateProcess, so I only need to worry about the process handle later.".

The only new interesting thing to this discussion is in http://www.cplusplus.com/forum/windows/48196/#msg261805 CodeMonkey suggests to check for exitCode == STATUS_PENDING. We don't treat STATUS_PENDING in any special way, we just return it as an exit code value. But STATUS_PENDING is not returned in TestContextCancel, because the failure is because child process succeeded, therefore child process returns exitcode 0.

Alex

@johnsonj
Copy link
Contributor

Related? haskell/process#77

@bradfitz
Copy link
Contributor

@johnsonj, totally! Good find.

@johnsonj
Copy link
Contributor

Piecing together a lot of stuff here, but let me know if this makes sense:

A child process (created by syscall.StartProcess) may spawn its own children. A parent process can exit before its child has exited. Waiting on the parent process' completion (os.Process.wait()) spawned by go does not guarantee that all of the children have shut down (and released their resources).

For example:

   foo.exe (process completed)
      |- bar.exe (running): test.txt is open
      |- baz.exe (process completed)

To wait for an entire process tree to exit we need to create a job (CreateJobObject) that will be inherited by all children and is set to terminate when all processes exit (JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE) and create an completion port (CreateIoCompletionPort) that is signaled when the job is completed. When we spawn the process (CreateProcess) we need to start it suspended (CREATE_SUSPENDED), associate it with the job (AssignProcessToJobObject) then finally start its execution (ResumeThread).

During Process.wait() we wait for the signal from the completion port (GetQueuedCompletionStatus instead of WaitForSingleObject). Once we receive JOB_OBJECT_MSG_EXIT_PROCESS we can call GetExitCodeProcess and return.

Issues (potentially) fixed: #17245, #23171
Related feature request for jobs: #17608

If that sounds reasonable I can take a crack at implementing it. Sounds neat.

@bradfitz
Copy link
Contributor

Sounds plausible! Definitely worth implementing to see where it gets us. If that'd eliminate those test flakes and also remove that 5ms sleep in Wait, that'd be great.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/84896 mentions this issue: os, syscall: use jobs to wait for process completion on windows

@golang golang deleted a comment from gopherbot Dec 19, 2017
@alexbrainman
Copy link
Member Author

Just some rumbling ...

I am not sure that we should change os.Process.Wait to wait for "all child process children to exit". I don't see os.Process.Wait documentation describe that. Image a child process that starts its own child (that never exits) and then exits immediately - I expect os.Process.Wait will return immediately now, but will hang forever if we change that behavior. Is it OK to change the behavior?

If the solution turns out to be waiting for grandchildren processes, then the solution needs to be in cmd/go and not in os or os/exec.

Alex

@aclements
Copy link
Member

This has been happening pretty regularly on our new Windows XP builders

As far as I can tell, this went from never happening on the XP builders to happening 100% of the time with CL 81895:

$ greplogs -dashboard -E TestContextCancel -E "program unexpectedly exited successfully" -l | findflakes -paths
First observed 29be20a 11 Dec 19:08 2017 (75 commits ago)
Last observed  3645418 03 Jan 19:30 2018 (1 commits ago)
47% chance failure is still happening
53% failure probability (40 of 75 commits)
Likely culprits:
   53% 29be20a cmd/go: invalidate cached test results if env vars or files change
   25% 8c22776 internal/poll: fix error in increfAndClose documentation
   12% 043f112 runtime: reset write barrier buffer on all flush paths
    6% 3675bff runtime: mark heapBits.bits nosplit
Past failures:
  1e679ae 22 Nov 01:18 2017 to 2065685 01 Dec 00:42 2017
    2.1% failure probability (3 of 98 commits)

findflakes says 53%, but manually sampling the 47% where it "didn't" fail shows that it failed earlier and just didn't run this test at all.

It seems unlikely that CL really affected this test, but the change caused by that CL may be useful in debugging the failure.

@bradfitz
Copy link
Contributor

bradfitz commented Jan 5, 2018

@johnsonj, what's the status of this?

@johnsonj
Copy link
Contributor

johnsonj commented Jan 5, 2018

I've been AFK for the holidays but can prototype this more next week. @alexbrainman point is spot on that we don't want to unconditionally wait on all child processes to exit so if this solution proves to fix the flake we'll need some sort of option for the tests to use.

@bradfitz
Copy link
Contributor

It's too late in Go 1.10 to make big changes to how process management works on Windows anyway, so moving to Go 1.11. But we'd really like to see this fixed and to remove that 5ms sleep.

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Jan 10, 2018
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/87257 mentions this issue: os/exec: skip TestContextCancel on the Windows XP builder

gopherbot pushed a commit that referenced this issue Jan 10, 2018
Updates #17245

Change-Id: I3d7ea362809040fbbba4b33efd57bf2d27d4c390
Reviewed-on: https://go-review.googlesource.com/87257
Reviewed-by: Ian Lance Taylor <[email protected]>
Run-TryBot: Ian Lance Taylor <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/94255 mentions this issue: all: drop support for Windows Vista or below

gopherbot pushed a commit that referenced this issue Feb 15, 2018
Per the notice in the Go 1.10 release notes, this change drops the
support for Windows Vista or below (including Windows XP) and
simplifies the code for the sake of maintenance.

There is one exception to the above. The code related to DLL and
system calls still remains in the runtime package. The remaining code
will be refined and used for supporting upcoming Windows versions in
future.

Updates #17245
Fixes #23072

Change-Id: I9e2821721f25ef9b83dfbf85be2b7ee5d9023aa5
Reviewed-on: https://go-review.googlesource.com/94255
Run-TryBot: Brad Fitzpatrick <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
@zhangyoufu
Copy link
Contributor

Since Windows XP is not supported any more, maybe this issue should be closed?

@bradfitz
Copy link
Contributor

We can close this bug if there's another bug tracking removing the 5ms sleep on Windows.

@johnsonj
Copy link
Contributor

There isn't a bug tracking the removal afaik

@bradfitz
Copy link
Contributor

There is now. I filed #25965

Will close this bug.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

8 participants