os/exec: TestContextCancel flaky (on Windows?) #17245

alexbrainman · 2016-09-27T01:09:35Z

What version of Go are you using (`go version`)?

go version devel +e6143e1 Mon Sep 26 01:51:31 2016 +0000 windows/386

What operating system and processor architecture are you using (`go env`)?

set GOARCH=386
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=386
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=c:\dev
set GORACE=
set GOROOT=c:\dev\go
set GOTOOLDIR=c:\dev\go\pkg\tool\windows_386
set CC=gcc
set GOGCCFLAGS=-m32 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\DOCUME~~1\brainman\LOCALS~~1\Temp\go-build338345670=/tmp/go-build -gno-record-gcc-switches
set CXX=g++
set CGO_ENABLED=1

What did you do?

I run all.bat.

What did you expect to see?

all.bat run to successful completion.

What did you see instead?

ok      os      2.015s
--- FAIL: TestContextCancel (0.03s)
        exec_test.go:946: program unexpectedly exited successfully
FAIL
FAIL    os/exec 15.669s

This happens only occasionally. I discovered this while working on CL 29700 which makes this test fail more often.

If I understand TestContextCancel correctly, there is an expected race there between goroutine killing process and another goroutine checking that process was killed. And the error above seems like a wrong turn of that race. I don't see how the race can be avoided altogether. Not sure how to fix this. Maybe I am wrong about this.

Alex

The text was updated successfully, but these errors were encountered:

rsc · 2016-10-27T01:54:17Z

FWIW, I looked for exec_test.*program.unexpectedly.exited in all the build failure logs going back to early this year, and I don't see any such failures. I don't doubt that this happened to you, @alexbrainman, but it's interesting it is not happening on the builders.

TestContextCancel starts a subprocess that prints strings from stdin back to stdout. It writes "echo" to the process once. Then it calls cancel() to kill that subprocess, and then it keeps writing "echo" to the subprocess until finally a write fails. It assumes that the write failing means the subprocess has died. Then it closes the pipe it was writing on. All this time there was a goroutine reading the subprocess's standard output and saving it by calling t.Log. The main goroutine waits for that reading goroutine to see EOF (again this should happen because the subprocess died). Now that we're doubly sure that the subprocess died, the main goroutine calls c.Wait. That should report the unsuccessful exit of the subprocess, but in the example you posted it gets nil instead.

I have no idea why CL 29700 (use RtlGenRandom instead of CryptGenRandom) would make this fail more often.

It's true that the actual c.Process.Kill happens asynchronously after cancel returns, but that kill should be what makes the write fail, and we don't do any of the other teardown until after the write fails.

If the write failed for some reason other than the subprocess dying, then closing the writing pipe and waiting for the EOF on the reading side could possibly tear down the subprocess normally. You could imagine adding

diff --git a/src/os/exec/exec_test.go b/src/os/exec/exec_test.go
index b143437..d02051a 100644
--- a/src/os/exec/exec_test.go
+++ b/src/os/exec/exec_test.go
@@ -949,6 +949,7 @@ func TestContextCancel(t *testing.T) {
    start := time.Now()
    for {
        if _, err := io.WriteString(w, "echo"); err != nil {
+           t.Logf("write: %v", err)
            break
        }
        if time.Since(start) > time.Second {

to see exactly what error write gets. However, the read side is logging all the messages it gets back, and in the case of a graceful shutdown it should have gotten back the very first "echo" (sent before the cancel), and in the transcript there is nothing logged by the reading goroutine. This suggests the subprocess really did die, and c.Wait somehow failed to report that fact correctly.

Looking at package os's (*Process).wait, I can't see how that could happen. However, this comment is concerning:

// NOTE(brainman): It seems that sometimes process is not dead
// when WaitForSingleObject returns. But we do not know any
// other way to wait for it. Sleeping for a while seems to do
// the trick sometimes. So we will sleep and smell the roses.
defer time.Sleep(5 * time.Millisecond)
defer p.Release()

It really should not be the case that the process is not dead when WaitForSingleObject returns. On top of that, if the process were not dead, then GetExitCodeProcess would return ec=259 (STILL_ACTIVE), which would end up becoming a non-nil error from c.Wait. But c.Wait returned nil. So probably the process really was dead at least in this case.

On the other hand, if somehow GetExitCodeProcess failed (returned false) but we didn't notice the failure, then maybe ec is uninitialized, so zero, which looks like success.

I notice that runtime.asmstdcall returns a uintptr, which is the whole AX register after the DLL call. I wonder if maybe the functions that return a bool only guarantee to set the bottom 8 bits of that register (just AL), in which case the test r1 == 0 in syscall.GetExitCodeProcess would more properly be uint8(r1) == 0. If the high bits had garbage in them and were not expected to be used, that could manifest as a false "true" result. Whether this happened would probably be highly dependent on the specific code being called and what that code happens to have done or not done with the full AX register.

If you can reproduce the problem on demand, maybe try uint8(r1) in GetExitCodeProcess and see if that changes anything.

Otherwise I'm out of ideas.

alexbrainman · 2016-10-27T06:28:04Z

I can still reproduce this with +db82cf4:

diff --git a/src/os/exec.go b/src/os/exec.go
index 8a53e5d..148e308 100644
--- a/src/os/exec.go
+++ b/src/os/exec.go
@@ -15,6 +15,8 @@ import (
 // Process stores the information about a process created by StartProcess.
 type Process struct {
    Pid    int
+   Alex   error
+   Alex2  syscall.Rusage
    handle uintptr      // handle is accessed atomically on Windows
    isdone uint32       // process has been successfully waited on, non zero if true
    sigMu  sync.RWMutex // avoid race between wait and signal
diff --git a/src/os/exec/exec_test.go b/src/os/exec/exec_test.go
index 8d44401..b24e357 100644
--- a/src/os/exec/exec_test.go
+++ b/src/os/exec/exec_test.go
@@ -924,11 +924,13 @@ func TestContextCancel(t *testing.T) {

    cancel()

+   var alexerr error
    // Calling cancel should have killed the process, so writes
    // should now fail.  Give the process a little while to die.
    start := time.Now()
    for {
        if _, err := io.WriteString(w, "echo"); err != nil {
+           alexerr = err
            break
        }
        if time.Since(start) > time.Second {
@@ -943,6 +945,10 @@ func TestContextCancel(t *testing.T) {
    <-readDone

    if err := c.Wait(); err == nil {
+       fmt.Printf("c.Process.Alex: %v\n", c.Process.Alex)
+       fmt.Printf("c.Process.Alex2: %+v\n", c.Process.Alex2)
+       fmt.Printf("c.ProcessState: %#v\n", c.ProcessState)
+       fmt.Printf("write: %v\n", alexerr)
        t.Error("program unexpectedly exited successfully")
    } else {
        t.Logf("exit status: %v", err)
diff --git a/src/os/exec_windows.go b/src/os/exec_windows.go
index d89db20..c849e55 100644
--- a/src/os/exec_windows.go
+++ b/src/os/exec_windows.go
@@ -26,6 +26,7 @@ func (p *Process) wait() (ps *ProcessState, err error) {
    }
    var ec uint32
    e = syscall.GetExitCodeProcess(syscall.Handle(handle), &ec)
+   p.Alex = e
    if e != nil {
        return nil, NewSyscallError("GetExitCodeProcess", e)
    }
@@ -34,6 +35,7 @@ func (p *Process) wait() (ps *ProcessState, err error) {
    if e != nil {
        return nil, NewSyscallError("GetProcessTimes", e)
    }
+   p.Alex2 = u
    p.setDone()
    // NOTE(brainman): It seems that sometimes process is not dead
    // when WaitForSingleObject returns. But we do not know any

occasionally, if I run test in a loop:

go test -v -i && for /l %x in (1, 1, 100) do go test -short os/exec

...
C:\dev\go\src\os\exec>go test -short
c.Process.Alex: <nil>
c.Process.Alex2: {CreationTime:{LowDateTime:38306894 HighDateTime:30552088} ExitTime:{LowDateTime:38306894 HighDateTime:30552088} KernelTime:{LowDateTime:0 HighDateTime:0} UserTime:{LowDateTime:0 HighDateTime:0}}
c.ProcessState: &os.ProcessState{pid:1732, status:syscall.WaitStatus{ExitCode:0x0}, rusage:(*syscall.Rusage)(0x1102ca80)}
write: write |1: The pipe is being closed.
--- FAIL: TestContextCancel (0.00s)
        exec_test.go:952: program unexpectedly exited successfully
FAIL
exit status 1
FAIL    os/exec 4.422s

C:\dev\go\src\os\exec>go test -short
c.Process.Alex: <nil>
c.Process.Alex2: {CreationTime:{LowDateTime:91900987 HighDateTime:30552088} ExitTime:{LowDateTime:91900987 HighDateTime:30552088} KernelTime:{LowDateTime:0 HighDateTime:0} UserTime:{LowDateTime:0 HighDateTime:0}}
c.ProcessState: &os.ProcessState{pid:5976, status:syscall.WaitStatus{ExitCode:0x0}, rusage:(*syscall.Rusage)(0x10fdc580)}
write: write |1: The pipe is being closed.
--- FAIL: TestContextCancel (0.00s)
        exec_test.go:952: program unexpectedly exited successfully
FAIL
exit status 1
FAIL    os/exec 4.109s

C:\dev\go\src\os\exec>go test -short
...

I do not buy your GetExitCodeProcess return code theory. GetExitCodeProcess returns BOOL, and BOOL is C int

typedef int BOOL;

https://msdn.microsoft.com/en-us/library/windows/desktop/aa383751(v=vs.85).aspx
so it should be the same as uintptr.

And if I start using uint8(r1) == 0 in syscall.GetExitCodeProcess the test still fails for me.

Alex

bradfitz · 2017-12-13T17:20:47Z

This has been happening pretty regularly on our new Windows XP builders, @rsc.

https://build.golang.org/log/929cb405f2485a5071853e9912feb3ddcec91ab1
https://build.golang.org/log/add46db4ba87885622dcd34ca43654b086717e75
https://build.golang.org/log/2384b6530461bbf2363241f2b6e47fea5fe4ab2e
etc

Also, very similar:
https://build.golang.org/log/c808dfad032954fa1936cbfa6e0aa2a8aec41fd5

And with both:
https://build.golang.org/log/499976e810d424df182331b25db509c3f66d004a

bradfitz · 2017-12-13T17:21:07Z

/cc @ianlancetaylor for any thoughts

alexbrainman · 2017-12-13T21:06:24Z

I have seen this happen a lot, but always on windows xo, never on windows 7 or 10. I spent some time debugging this, but not a lot, given this only affects windows xo users. I am certain there is a bug in our code.

Alex

gopherbot · 2017-12-14T23:25:10Z

Change https://golang.org/cl/84175 mentions this issue: os: remove unconditional wait sleep on Windows, do it as needed

ianlancetaylor · 2017-12-15T01:12:29Z

@bradfitz You may be onto something with that sleeping code. MSDN says that GetExitCodeProcess only returns an exit code after the thread terminates. Maybe you should try moving that sleep/SYNCHRONIZE loop up before the call to GetExitCodeProcess.

bradfitz · 2017-12-15T01:13:36Z

@ianlancetaylor, interesting! Good find. Will experiment.

(That CL wasn't meant to fix this issue. I just discovered it along the way.)

bradfitz · 2017-12-15T18:03:27Z

I found some discussion on the internet that maybe we shouldn't be closing the thread handle from CreateProcess right away.

In syscall/exec_windows.go we do:

        pi := new(ProcessInformation)

        flags := sys.CreationFlags | CREATE_UNICODE_ENVIRONMENT
        if sys.Token != 0 {
                err = CreateProcessAsUser(sys.Token, argv0p, argvp, nil, nil, true, flags, createEnvBlock(attr.Env), dirp, si, pi)
        } else {
                err = CreateProcess(argv0p, argvp, nil, nil, true, flags, createEnvBlock(attr.Env), dirp, si, pi)
        }
        if err != nil {
                return 0, 0, err
        }
        defer CloseHandle(Handle(pi.Thread))
        
        return int(pi.ProcessId

Note that CreateProcess returns both a thread handle and a process handle, but we discard the thread handle.

@alexbrainman, thoughts?

bradfitz · 2017-12-15T18:04:26Z

http://www.cplusplus.com/forum/windows/48196/#msg261691 says:

Edit : to answer my own question. Do not close handles prior to checking the return code.

That wording with "handles" plural makes me think they mean both thread+process handles.

alexbrainman · 2017-12-15T22:36:57Z

I found some discussion on the internet that maybe we shouldn't be closing the thread handle from CreateProcess right away.

If you are talking about this fragment

   WaitForSingleObject(piProcInfo.hProcess, timeOut_in);
   CloseHandle(piProcInfo.hProcess);
   CloseHandle(piProcInfo.hThread);
   DWORD exitCode;
   if(!GetExitCodeProcess(piProcInfo.hProcess, &exitCode)){
	   printf("Exit code = %X\n", exitCode);
   }

from http://www.cplusplus.com/forum/windows/48196/ then the fragment have a bug - it closes process handle piProcInfo.hProcess and then uses it when calling GetExitCodeProcess. That is undefined behavior.

Note that CreateProcess returns both a thread handle and a process handle, but we discard the thread handle.

We don't just "discard" the thread handle, we close the handle. I don't see the problem with that. According to https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425(v=vs.85).aspx : "...
Handles in PROCESS_INFORMATION must be closed with CloseHandle when they are no longer needed. ...". So that is what we do. Even last message in http://www.cplusplus.com/forum/windows/48196/#msg261805 says: "I do usually close the thread handle immediately after a CreateProcess, so I only need to worry about the process handle later.".

The only new interesting thing to this discussion is in http://www.cplusplus.com/forum/windows/48196/#msg261805 CodeMonkey suggests to check for exitCode == STATUS_PENDING. We don't treat STATUS_PENDING in any special way, we just return it as an exit code value. But STATUS_PENDING is not returned in TestContextCancel, because the failure is because child process succeeded, therefore child process returns exitcode 0.

Alex

johnsonj · 2017-12-18T21:16:36Z

Related? haskell/process#77

bradfitz · 2017-12-18T21:21:50Z

@johnsonj, totally! Good find.

johnsonj · 2017-12-19T01:32:10Z

Piecing together a lot of stuff here, but let me know if this makes sense:

A child process (created by syscall.StartProcess) may spawn its own children. A parent process can exit before its child has exited. Waiting on the parent process' completion (os.Process.wait()) spawned by go does not guarantee that all of the children have shut down (and released their resources).

For example:

   foo.exe (process completed)
      |- bar.exe (running): test.txt is open
      |- baz.exe (process completed)

To wait for an entire process tree to exit we need to create a job (CreateJobObject) that will be inherited by all children and is set to terminate when all processes exit (JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE) and create an completion port (CreateIoCompletionPort) that is signaled when the job is completed. When we spawn the process (CreateProcess) we need to start it suspended (CREATE_SUSPENDED), associate it with the job (AssignProcessToJobObject) then finally start its execution (ResumeThread).

During Process.wait() we wait for the signal from the completion port (GetQueuedCompletionStatus instead of WaitForSingleObject). Once we receive JOB_OBJECT_MSG_EXIT_PROCESS we can call GetExitCodeProcess and return.

Issues (potentially) fixed: #17245, #23171
Related feature request for jobs: #17608

If that sounds reasonable I can take a crack at implementing it. Sounds neat.

bradfitz · 2017-12-19T01:40:31Z

Sounds plausible! Definitely worth implementing to see where it gets us. If that'd eliminate those test flakes and also remove that 5ms sleep in Wait, that'd be great.

gopherbot · 2017-12-19T22:22:41Z

Change https://golang.org/cl/84896 mentions this issue: os, syscall: use jobs to wait for process completion on windows

alexbrainman · 2017-12-25T01:12:00Z

Just some rumbling ...

I am not sure that we should change os.Process.Wait to wait for "all child process children to exit". I don't see os.Process.Wait documentation describe that. Image a child process that starts its own child (that never exits) and then exits immediately - I expect os.Process.Wait will return immediately now, but will hang forever if we change that behavior. Is it OK to change the behavior?

If the solution turns out to be waiting for grandchildren processes, then the solution needs to be in cmd/go and not in os or os/exec.

Alex

aclements · 2018-01-03T19:58:46Z

This has been happening pretty regularly on our new Windows XP builders

As far as I can tell, this went from never happening on the XP builders to happening 100% of the time with CL 81895:

$ greplogs -dashboard -E TestContextCancel -E "program unexpectedly exited successfully" -l | findflakes -paths
First observed 29be20a 11 Dec 19:08 2017 (75 commits ago)
Last observed  3645418 03 Jan 19:30 2018 (1 commits ago)
47% chance failure is still happening
53% failure probability (40 of 75 commits)
Likely culprits:
   53% 29be20a cmd/go: invalidate cached test results if env vars or files change
   25% 8c22776 internal/poll: fix error in increfAndClose documentation
   12% 043f112 runtime: reset write barrier buffer on all flush paths
    6% 3675bff runtime: mark heapBits.bits nosplit
Past failures:
  1e679ae 22 Nov 01:18 2017 to 2065685 01 Dec 00:42 2017
    2.1% failure probability (3 of 98 commits)

findflakes says 53%, but manually sampling the 47% where it "didn't" fail shows that it failed earlier and just didn't run this test at all.

It seems unlikely that CL really affected this test, but the change caused by that CL may be useful in debugging the failure.

bradfitz · 2018-01-05T21:41:55Z

@johnsonj, what's the status of this?

johnsonj · 2018-01-05T22:26:34Z

I've been AFK for the holidays but can prototype this more next week. @alexbrainman point is spot on that we don't want to unconditionally wait on all child processes to exit so if this solution proves to fix the flake we'll need some sort of option for the tests to use.

bradfitz · 2018-01-10T19:13:15Z

It's too late in Go 1.10 to make big changes to how process management works on Windows anyway, so moving to Go 1.11. But we'd really like to see this fixed and to remove that 5ms sleep.

gopherbot · 2018-01-10T19:16:38Z

Change https://golang.org/cl/87257 mentions this issue: os/exec: skip TestContextCancel on the Windows XP builder

Updates #17245 Change-Id: I3d7ea362809040fbbba4b33efd57bf2d27d4c390 Reviewed-on: https://go-review.googlesource.com/87257 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>

gopherbot · 2018-02-15T08:49:16Z

Change https://golang.org/cl/94255 mentions this issue: all: drop support for Windows Vista or below

Per the notice in the Go 1.10 release notes, this change drops the support for Windows Vista or below (including Windows XP) and simplifies the code for the sake of maintenance. There is one exception to the above. The code related to DLL and system calls still remains in the runtime package. The remaining code will be refined and used for supporting upcoming Windows versions in future. Updates #17245 Fixes #23072 Change-Id: I9e2821721f25ef9b83dfbf85be2b7ee5d9023aa5 Reviewed-on: https://go-review.googlesource.com/94255 Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>

zhangyoufu · 2018-06-13T10:25:02Z

Since Windows XP is not supported any more, maybe this issue should be closed?

bradfitz · 2018-06-13T13:50:06Z

We can close this bug if there's another bug tracking removing the 5ms sleep on Windows.

johnsonj · 2018-06-18T15:43:17Z

There isn't a bug tracking the removal afaik

bradfitz · 2018-06-19T22:08:57Z

There is now. I filed #25965

Will close this bug.

bradfitz changed the title ~~os/exec: TestContextCancel fails with "program unexpectedly exited successfully"~~ os/exec: TestContextCancel flaky (on Windows?) Sep 27, 2016

bradfitz added Testing An issue that has been verified to require only test changes, not just a test failure. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 27, 2016

bradfitz added this to the Go1.8 milestone Sep 27, 2016

rsc mentioned this issue Oct 27, 2016

os: using CommandLineToArgv slows process startup significantly #15588

Closed

rsc modified the milestones: Go1.9, Go1.8 Nov 11, 2016

alexbrainman mentioned this issue Feb 21, 2017

os/exec: flaky TestStdinCloseRace #19211

Closed

bradfitz added help wanted OS-Windows labels Jul 6, 2017

bradfitz modified the milestones: Go1.10, Go1.9 Jul 6, 2017

golang deleted a comment from gopherbot Dec 19, 2017

bradfitz assigned johnsonj Jan 10, 2018

bradfitz modified the milestones: Go1.10, Go1.11 Jan 10, 2018

alexbrainman mentioned this issue Apr 6, 2018

all: test failures after calling all.bat on Windows7 #24709

Closed

ianlancetaylor mentioned this issue Jun 13, 2018

os/exec: Wait behavior changed from Go 1.9 to Go 1.10 on Windows #25835

Closed

bradfitz mentioned this issue Jun 19, 2018

os: remove 5ms sleep on Windows in (*Process).Wait #25965

Closed

bradfitz closed this as completed Jun 19, 2018

bradfitz mentioned this issue Oct 24, 2018

os: remove 5ms exec sleep on Windows #28371

Closed

jchv mentioned this issue Jan 13, 2019

Add Windows support bazelbuild/bazel-watcher#144

Merged

golang locked and limited conversation to collaborators Jun 19, 2019

gopherbot added the FrozenDueToAge label Jun 19, 2019

rsc unassigned johnsonj Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/exec: TestContextCancel flaky (on Windows?) #17245

os/exec: TestContextCancel flaky (on Windows?) #17245

alexbrainman commented Sep 27, 2016

rsc commented Oct 27, 2016

alexbrainman commented Oct 27, 2016

bradfitz commented Dec 13, 2017

bradfitz commented Dec 13, 2017

alexbrainman commented Dec 13, 2017

gopherbot commented Dec 14, 2017

ianlancetaylor commented Dec 15, 2017

bradfitz commented Dec 15, 2017

bradfitz commented Dec 15, 2017

bradfitz commented Dec 15, 2017

alexbrainman commented Dec 15, 2017

johnsonj commented Dec 18, 2017

bradfitz commented Dec 18, 2017

johnsonj commented Dec 19, 2017

bradfitz commented Dec 19, 2017

gopherbot commented Dec 19, 2017

alexbrainman commented Dec 25, 2017

aclements commented Jan 3, 2018

bradfitz commented Jan 5, 2018

johnsonj commented Jan 5, 2018

bradfitz commented Jan 10, 2018

gopherbot commented Jan 10, 2018

gopherbot commented Feb 15, 2018

zhangyoufu commented Jun 13, 2018

bradfitz commented Jun 13, 2018

johnsonj commented Jun 18, 2018

bradfitz commented Jun 19, 2018

os/exec: TestContextCancel flaky (on Windows?) #17245

os/exec: TestContextCancel flaky (on Windows?) #17245

Comments

alexbrainman commented Sep 27, 2016

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

rsc commented Oct 27, 2016

alexbrainman commented Oct 27, 2016

bradfitz commented Dec 13, 2017

bradfitz commented Dec 13, 2017

alexbrainman commented Dec 13, 2017

gopherbot commented Dec 14, 2017

ianlancetaylor commented Dec 15, 2017

bradfitz commented Dec 15, 2017

bradfitz commented Dec 15, 2017

bradfitz commented Dec 15, 2017

alexbrainman commented Dec 15, 2017

johnsonj commented Dec 18, 2017

bradfitz commented Dec 18, 2017

johnsonj commented Dec 19, 2017

bradfitz commented Dec 19, 2017

gopherbot commented Dec 19, 2017

alexbrainman commented Dec 25, 2017

aclements commented Jan 3, 2018

bradfitz commented Jan 5, 2018

johnsonj commented Jan 5, 2018

bradfitz commented Jan 10, 2018

gopherbot commented Jan 10, 2018

gopherbot commented Feb 15, 2018

zhangyoufu commented Jun 13, 2018

bradfitz commented Jun 13, 2018

johnsonj commented Jun 18, 2018

bradfitz commented Jun 19, 2018

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?