syscall: memory corruption when forking on OpenBSD, NetBSD, AIX, and Solaris #34988

jrick · 2019-10-18T14:49:09Z

#!watchflakes
default <- `fatal error: (?:.*\n\s*)*syscall\.forkExec` && (goos == "aix" || goos == "netbsd" || goos == "openbsd" || goos == "solaris")

What version of Go are you using (`go version`)?

$ go version
go version go1.13.2 openbsd/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jrick/.cache/go-build"
GOENV="/home/jrick/.config/go/env"
GOEXE=""
GOFLAGS="-tags=netgo -ldflags=-extldflags=-static"
GOHOSTARCH="amd64"
GOHOSTOS="openbsd"
GONOPROXY=""
GONOSUMDB=""
GOOS="openbsd"
GOPATH="/home/jrick/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/jrick/src/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/jrick/src/go/pkg/tool/openbsd_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"

What did you do?

I observed these issues in one of my applications, and assumed it was a race or invalid unsafe.Pointer usage or some other fault of the application code. When the 1.13.2 release dropped yesterday I built it from source and observed a similar issue running the regression tests. The failed regression test does not look related to the memory corruption, but I can reproduce the problem by repeatedly running the test in a loop:

$ cd test # from go repo root
$ while :; do go run run.go -- fixedbugs/issue27829.go || break; done >go.panic 2>&1

It can take several minutes to observe the issue but here are some of the captured panics and fatal runtime errors:

https://gist.githubusercontent.com/jrick/f8b21ecbfbe516e1282b757d1bfe4165/raw/6cf0efb9ba47ba869f98817ce945971f2dff47d6/gistfile1.txt

https://gist.githubusercontent.com/jrick/9a54c085b918aa32910f4ece84e5aa21/raw/91ec29275c2eb1be49f62ad8a01a5317ad168c94/gistfile1.txt

https://gist.githubusercontent.com/jrick/8faf088593331c104cc0da0adb3f24da/raw/7c92e7e7d60d426b2156fd1bdff42e0717b708f1/gistfile1.txt

https://gist.githubusercontent.com/jrick/4645316444c12cd815fb71874f6bdfc4/raw/bffac2a448b07242a538b77a2823c9db34b6ef6f/gistfile1.txt

https://gist.githubusercontent.com/jrick/3843b180670811069319e4122d32507a/raw/0d1f897aa25d91307b04ae951f1b260f33246b61/gistfile1.txt

https://gist.githubusercontent.com/jrick/99b7171c5a49b4b069edf06884ad8e17/raw/740c7b9e8fa64d9ad149fd2669df94e89c466927/gistfile1.txt

Additionally, I observed go run hanging (no runtime failure due to deadlock) and it had to be killed with SIGABRT to get a trace: https://gist.githubusercontent.com/jrick/d4ae1e4355a7ac42f1910b7bb10a1297/raw/54e408c51a01444abda76dc32ac55c2dd217822b/gistfile1.txt

It may not matter which regression test is run as the errors also occur in run.go.

The text was updated successfully, but these errors were encountered:

jrick · 2019-10-18T15:08:46Z

I missed that 1.13.3 was also released yesterday. Currently updating to that and will report whether this is still an issue.

randall77 · 2019-10-18T15:17:00Z

This looks like cmd/go crashing while building the test, not the test itself.
The errors look heap realated. @mknyszek

mknyszek · 2019-10-18T15:38:32Z

@jrick maybe you meant this in your original post, but I just want to be clear. Does this reproduce with Go 1.12.X or older versions of Go?

Since we have a reasonable reproducer, the next step to me would be to just bisect what went into Go 1.13, if we know it isn't reproducing in Go 1.12. I genuinely have no idea what this could be. I thought at first that it could be scavenging related but that's highly unlikely for a number of reasons. I won't rule it out yet, though.

jrick · 2019-10-18T15:41:49Z

I haven't tested 1.12.x but will follow up testing that next. Currently hammering this test with 1.13.3 and so far it has not failed, but my application built with 1.13.3 still fails with SIGBUS (could be unrelated).

jrick · 2019-10-18T16:02:16Z

@mknyszek it still hasn't failed on 1.13.3 (running close to an hour now) but quickly failed on 1.12.12.

https://gist.githubusercontent.com/jrick/bb5a493e6ebd88e1e846f1c5c09c9e9a/raw/e82b0136b0826581f6e591915d3a634112f323a1/gistfile1.txt

jrick · 2019-10-18T16:07:13Z

1.13.3 finally errored after an hour.

https://gist.githubusercontent.com/jrick/bc61caaf7c9ec42a4dbdfe35bc466f81/raw/d8d12e83b3739aab06916bdd3829f5f7d6465906/gistfile1.txt

more errors from 1.13.3:

https://gist.githubusercontent.com/jrick/4c3f95703d723a00ccfdd5bc1e66a2d7/raw/2e96bb7bde68df5ba4ec2616694daa5dc7735307/gistfile1.txt

https://gist.githubusercontent.com/jrick/91ef3e4563e888799eab108911fee783/raw/9879d012c8f05f87b1ebc634584f63ca43ce2f9a/gistfile1.txt

https://gist.githubusercontent.com/jrick/526547300e41d3862020c49cc635a627/raw/019e65c005cf85be192d105ca9bf2b3aa0f2fca9/gistfile1.txt

https://gist.githubusercontent.com/jrick/b8c602332d1ecffc50b532d13f526bb3/raw/6d81e6035eeca726b7e105803507b7f7660eacef/gistfile1.txt

https://gist.githubusercontent.com/jrick/790cb535cc29666c93f7e2ffa1e06b7f/raw/67457c764b19514073de3c4b6b55e9aa8bea52db/gistfile1.txt

https://gist.githubusercontent.com/jrick/b393efebc2f6fe034fbde5070b413d47/raw/87ba312c229a9210bd0220aa957f7bb730970b23/gistfile1.txt

jrick · 2019-12-06T19:10:01Z

This remains a problem in 1.13.5, so it's not addressed by the recent fixes to the go tool.

https://gist.githubusercontent.com/jrick/a2499b2ae10b4c63359174e26c0fd936/raw/b233f14a518ca828c4416d803f81b1e8ca34d073/gistfile1.txt

jrick · 2020-05-20T14:58:50Z

This may be fork/exec related. This program exhibits similar crashes on OpenBSD 6.7 and Go 1.14.3.

package main

import (
        "os/exec"
)

func main() {
        sem := make(chan struct{}, 100)
        for {
                sem <- struct{}{}
                go func() {
                        err := exec.Command("/usr/bin/true").Run()
                        if err != nil {
                                panic(err)
                        }
                        <-sem
                }()
        }
}

crash trace: https://gist.github.com/jrick/8d6ef72796a772668b891310a18dd805

Synchronizing the os/exec call with an additional mutex appears to remove the crash.

ianlancetaylor · 2020-05-20T21:43:38Z

Thanks for the stack trace. That looks very much like a forked child process is changing the memory seen by the parent process. Which should of course be impossible. Specifically it seems that sched.lock.key is being set to zero while the lock is held during goschedImpl.

jrick · 2020-05-22T13:55:26Z

I'm seeing another strange thing in addition to that crash. Sometimes the program will run forever, spinning cpu, but appears to be deadlocked because none of the pids of those true processes are ever changing. Here's the trace after sending sigquit: https://gist.github.com/jrick/74aaa63624961145b7bc7b9518da75e1

jrick · 2020-09-14T14:23:17Z

I am currently testing with this OpenBSD kernel patch to the virtual memory system:

https://marc.info/?l=openbsd-tech&m=160008279223088&w=2

however these crashes still persist.

https://gist.githubusercontent.com/jrick/ce975453b230b583597b5cd99e989bf2/raw/088219ee1b4a28d8617ba3fc5c04904b55116046/gistfile1.txt

https://gist.githubusercontent.com/jrick/957e7cb28bef3cf5a60d9bda0cbd1257/raw/b9b9b4d18557d8c23bc0d9fb9f839d85d20a900a/gistfile1.txt

https://gist.githubusercontent.com/jrick/295159593904a7d8404dbb9216ff6566/raw/ff3fe4752815bab7857ef22baa1eaa2acd72d525/gistfile1.txt

Another interesting data point: so far it appears that this only reproduces on amd ryzen cpus, and not any intel ones.

bcmills · 2021-05-14T18:50:52Z

https://build.golang.org/log/3f45171bc52a0a86435abb9f795c0e8a45c4a0b0 looks similar:

haserrors/haserrors.go:3:18: undeclared name: undefined
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x4d578d58 pc=0x804a257]

runtime stack:
runtime/internal/atomic.Xadd64(0x83de928, 0x20360)
	/tmp/workdir/go/src/runtime/internal/atomic/atomic_386.s:145 +0x27

…

goroutine 4371 [runnable]:
syscall.syscall(0x80b4b40, 0x12, 0x0, 0x0)
	/tmp/workdir/go/src/runtime/sys_openbsd3.go:22 +0x20
syscall.Close(0x12)
	/tmp/workdir/go/src/syscall/zsyscall_openbsd_386.go:513 +0x39
syscall.forkExec({0xa630a408, 0x16}, {0xa6158a80, 0xe, 0xe}, 0x6bc0cbc0)
	/tmp/workdir/go/src/syscall/exec_unix.go:227 +0x4cc
syscall.StartProcess(...)
	/tmp/workdir/go/src/syscall/exec_unix.go:264
os.startProcess({0xa630a408, 0x16}, {0xa6158a80, 0xe, 0xe}, 0x6bc0cc84)
	/tmp/workdir/go/src/os/exec_posix.go:55 +0x256
os.StartProcess({0xa630a408, 0x16}, {0xa6158a80, 0xe, 0xe}, 0x6bc0cc84)
	/tmp/workdir/go/src/os/exec.go:106 +0x57
os/exec.(*Cmd).Start(0xa3537b80)
	/tmp/workdir/go/src/os/exec/exec.go:422 +0x588
os/exec.(*Cmd).Run(0xa3537b80)
	/tmp/workdir/go/src/os/exec/exec.go:338 +0x1b
golang.org/x/tools/go/internal/cgo.Run(0xa26a8b40, {0xa2764090, 0x17}, {0xa60778e0, 0x20}, 0x0)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/internal/cgo/cgo.go:172 +0xc74
golang.org/x/tools/go/internal/cgo.ProcessFiles(0xa26a8b40, 0x837eecc0, 0x0, 0x0)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/internal/cgo/cgo.go:85 +0x1a1
golang.org/x/tools/go/loader.(*Config).parsePackageFiles(0x9d428420, 0xa26a8b40, 0x67)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:758 +0x232
golang.org/x/tools/go/loader.(*importer).load(0x837ed770, 0xa26a8b40)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:976 +0x68
golang.org/x/tools/go/loader.(*importer).startLoad.func1(0x837ed770, 0xa26a8b40, 0xa2afe0e0)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:962 +0x23
created by golang.org/x/tools/go/loader.(*importer).startLoad
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:961 +0x174

bcmills · 2021-12-01T15:01:54Z

https://storage.googleapis.com/go-build-log/abee19ae/openbsd-amd64-68_0f13ec3d.log (a TryBot) looks like it could plausibly be from a fork syscall.

jrick · 2021-12-01T16:42:49Z

I'm not sure when this changed but since returning to this issue I haven't been able to reproduce with my minimal test case again on the same hardware with OpenBSD 7.0-current and Go 1.17.3.

I suspect it's due to some OpenBSD fix if the 6.8 builders are still hitting this.

(also 6.8 is no longer a supported OpenBSD version; i don't think it makes much sense to continue testing with it)

jrick · 2021-12-01T17:38:10Z

spoke too soon:

package main

import "os/exec"

func main() {
	loop := func() {
		for {
			err := exec.Command("/usr/bin/true").Run()
			if err != nil {
				panic(err)
			}
		}
	}
	for i := 0; i < 100; i++ {
		go loop()
	}
	select {}
}

https://gist.githubusercontent.com/jrick/a071767cde2d2d71b210135cf8282b04/raw/6fcd814e5a93a6a1d204c2d00b0a1f4195664d61/gistfile1.txt

jrick · 2021-12-02T01:53:23Z

and it took far longer than 1.17.3 but a very similar crash (in scanstack) still occurs with

$ gotip version
go version devel go1.18-931d80ec1 Tue Nov 30 18:09:02 2021 +0000 openbsd/amd64

https://gist.githubusercontent.com/jrick/a13403d1a934f2cc5fedf7c2e2d50546/raw/7389fee0a5a35f40122a847206b6dbd7304b0fa0/gistfile1.txt

prattmic · 2021-12-02T21:55:13Z

I can also reproduce crashes on netbsd-386 and netbsd-amd64 with #34988 (comment) on AMD, of the form:

buildlet-netbsd-386-9-0-n2d-rncb33943# ./loop 
fatal error: runtime·unlock: lock count   
<hang>

as well as #49453

prattmic · 2021-12-03T18:15:41Z

Some observations I've made (from netbsd-amd64):

The crashes still seem to occur with GOMAXPROCS=1, however Go still has some background threads in this case. Disabling sysmon and GC makes this program truly single-threaded:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index a238ea77f3..ee18169920 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -170,10 +170,10 @@ func main() {
                // For runtime_syscall_doAllThreadsSyscall, we
                // register sysmon is not ready for the world to be
                // stopped.
-               atomic.Store(&sched.sysmonStarting, 1)
-               systemstack(func() {
-                       newm(sysmon, nil, -1)
-               })
+               //atomic.Store(&sched.sysmonStarting, 1)
+               //systemstack(func() {
+               //      newm(sysmon, nil, -1)
+               //})
        }
 
        // Lock the main goroutine onto this, the main OS thread,
@@ -211,7 +211,7 @@ func main() {
                }
        }()
 
-       gcenable()
+       //gcenable()
 
        main_init_done = make(chan bool)
        if iscgo {

Once the program is truly single-threaded, the crashes disappear. Setting GOMAXPROCS=2 with this patch brings the crashes back.

Here is a slightly simplified reproducer:

package main

import (
        "os/exec"
        "runtime"
)

func main() {
        go func() {
                for {
                        err := exec.Command("/usr/bin/true").Run()
                        if err != nil {
                                panic(err)
                        }
                }
        }()

        for {
                runtime.Gosched()
        }
}

This version has only a single forker, but crashes about as quickly. The Gosched is required. Neither an empty loop or a loop checking a package global atomic is sufficient to crash. (N.B. the original reproducer above was also occasionally effectively doing Gosched to context switch between the 100 forker goroutines).

(cc @aclements @mknyszek)

prattmic · 2021-12-03T22:06:29Z

More observations:

Crashes still occur if the child exits almost immediately after fork, here.
Crashes do not occur if RawSyscall(FORK) is replaced with simply returning an error.

I've simplified that repro even further:

loop.go:

package main

import (
        //"runtime"
        "syscall"
)

func fork() int32

func main() {
        go func() {
                for {
                        pid := fork()
                        syscall.Syscall6(syscall.SYS_WAIT4, uintptr(pid), 0, 0, 0, 0, 0)
                        //syscall.RawSyscall6(syscall.SYS_WAIT4, uintptr(pid), 0, 0, 0, 0, 0)
                }
        }()

        for {
                syscall.Syscall(syscall.SYS_GETPID, 0, 0, 0)
                //runtime.Gosched()
        }
}

fork_netbsd_amd64.s:

#include "textflag.h"

#define SYS_EXIT        1
#define SYS_FORK        2

// func fork() int32
TEXT ·fork(SB),NOSPLIT,$0-4
        MOVQ    $SYS_FORK, AX
        SYSCALL

        CMPQ    AX, $0
        JNE     parent

        // Child.
        MOVQ    $0, DI
        MOVQ    $SYS_EXIT, AX
        SYSCALL
        HLT

parent:
        MOVL    AX, ret+0(FP)
        RET

The key parts here:

We are now making a direct fork system call without any of the extra runtime behavior inside os/exec. Barring some (AFAICT undocumented) requirement on how to use fork(), there is simply no way this assembly function should be able to cause corruption in the parent. So I think this has to an OS bug.
We don't need runtime.Gosched() anymore, however switching the GETPID back to Gosched does trigger crashes faster.
Note that syscall.Syscall goes through runtime.entersyscall / runtime.exitsyscall, so there is some level of runtime interaction still, though I've verified we don't go through the slow path into the full scheduler like Gosched does.
Both goroutines must use syscall.Syscall. Switching either side to syscall.RawSyscall (which avoids runtime interaction) makes the crashes go away.
By best guess of the interesting pattern here is that both threads are fiddling around with TLS variables (the g).

The crashes I get with this look like (source):

entersyscall inconsistent 0xc00003a778 [0xc00003a000,0xc00003a800]                                                                           
fatal error: entersyscall

This is complaining that the assertion 0xc00003a000 < 0xc00003a778 < 0xc00003a800 fails (it does not).

The one case I've caught in GDB looks like (stopped just inside the failing branch):

   0x0000000000435c77 <+55>:      callq  0x435be0 <runtime.save>
   0x0000000000435c7c <+60>:      mov    0x60(%rsp),%rcx
   0x0000000000435c81 <+65>:      mov    0x20(%rsp),%rax
   0x0000000000435c86 <+70>:      mov    %rcx,0x70(%rax)
   0x0000000000435c8a <+74>:      mov    0x58(%rsp),%rdx
   0x0000000000435c8f <+79>:      mov    %rdx,0x78(%rax)
   0x0000000000435c93 <+83>:      mov    $0x2,%ebx
   0x0000000000435c98 <+88>:      mov    $0x3,%ecx
   0x0000000000435c9d <+93>:      nopl   (%rax)   
   0x0000000000435ca0 <+96>:      callq  0x4306e0 <runtime.casgstatus>                                                                       
   0x0000000000435ca5 <+101>:     mov    0x20(%rsp),%rcx
   0x0000000000435caa <+106>:     mov    (%rcx),%rdx
   0x0000000000435cad <+109>:     mov    %rdx,0x10(%rsp)
   0x0000000000435cb2 <+114>:     mov    0x8(%rcx),%rsi
   0x0000000000435cb6 <+118>:     mov    %rsi,0x18(%rsp)
   0x0000000000435cbb <+123>:     mov    0x70(%rcx),%rdi
   0x0000000000435cbf <+127>:     nop
   0x0000000000435cc0 <+128>:     cmp    %rdx,%rdi
   0x0000000000435cc3 <+131>:     jb     0x435cca <runtime.reentersyscall+138>
   0x0000000000435cc5 <+133>:     cmp    %rsi,%rdi
   0x0000000000435cc8 <+136>:     jbe    0x435d2a <runtime.reentersyscall+234>
=> 0x0000000000435cca <+138>:     mov    $0x6,%eax

(gdb) i r
rax            0x2                 2
rbx            0x2                 2
rcx            0xc000002820        824633731104
rdx            0xc00003a000        824633958400
rsi            0xc00003a800        824633960448
rdi            0x0                 0
rbp            0xc00003a748        0xc00003a748
rsp            0xc00003a700        0xc00003a700
r8             0x1                 1
r9             0x0                 0
r10            0x0                 0
r11            0x212               530
r12            0xc000028a00        824633887232
r13            0x0                 0
r14            0xc000002820        824633731104
r15            0x7f7fd135111a      140186947490074
rip            0x435cca            0x435cca <runtime.reentersyscall+138>
eflags         0x287               [ CF PF SF IF ]
cs             0x47                71
ss             0x3f                63
ds             0x23                35
es             0x23                35
fs             0x0                 0
gs             0x0                 0
fs_base        <unavailable>
gs_base        <unavailable>

From the assembly, _g_.stack.lo and _g_.stack.hi should be rdx and rsi, which look OK. _g_.syscallsp should be rdi, which is 0. This value was recently loaded from rcx + 0x70, which looks fine:

(gdb) x/xg $rcx + 0x70
0xc000002890:     0x000000c00003a778

Of course, I can't really tell if that memory location read as zero, or if the register was cleared after the load somehow.

gopherbot · 2022-10-06T18:45:18Z

Change https://go.dev/cl/439196 mentions this issue: os/exec: parallelize more tests

This cuts the wall duration for 'go test os/exec' and 'go test -race os/exec' roughly in half on my machine, which is an even more significant speedup with a high '-count'. For better or for worse, it may also increase the repro rate of #34988. Tests that use Setenv or Chdir or check for FDs opened during the test still cannot be parallelized, but they are only a few of those. Change-Id: I8d284d8bff05787853f825ef144aeb7a4126847f Reviewed-on: https://go-review.googlesource.com/c/go/+/439196 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Bryan Mills <[email protected]> Auto-Submit: Bryan Mills <[email protected]>

bcmills · 2022-11-03T15:59:31Z

https://storage.googleapis.com/go-build-log/0bc28733/openbsd-amd64-70_143c2694.log (a TryBot on https://go.dev/cl/447198)

davepacheco · 2022-12-19T22:48:25Z

While debugging oxidecomputer/omicron#1146 I saw that this bug mentions Solaris and wondered if it might affect illumos as well, since the failure modes look the same for my issue. For the record, I don't think my issue was caused by this one. I ran the Go and C test programs for several days without issue, and I ultimately root-caused my issue to illumos#15254. I mention this in case anyone in the future is wondering if illumos is affected by this. I don't know whether Solaris (or any other system) has the same issue with preserving the %ymm registers across signal handlers, but that can clearly cause the same failure modes shown here.

gopherbot · 2023-01-06T18:34:19Z

Found new dashboard test flakes for:

#!watchflakes
default <- `fatal error: (?:.*\n\s*)*syscall\.forkExec`

2023-01-06 17:30 netbsd-amd64-9_3 tools@36bd3dbc go@476384ec x/tools/gopls/internal/regtest/workspace.TestReloadOnlyOnce (log)

wirep: p->m=824637390848(7) p->status=1
fatal error: wirep: invalid p state
wirep: p->m=0(0) p->status=2
fatal error: wirep: invalid p state

runtime stack:
runtime.throw({0xf1b04b?, 0x0?})
	/tmp/workdir/go/src/runtime/panic.go:992 +0x71
runtime.wirep(0xc0004524e0?)
	/tmp/workdir/go/src/runtime/proc.go:4903 +0x105
...

testing.(*T).Run(0xc00a402680, {0xeefab7?, 0xc00a351f40?}, 0xc00a3f29b0)
	/tmp/workdir/go/src/testing/testing.go:1487 +0x37a
golang.org/x/tools/gopls/internal/lsp/regtest.(*Runner).Run(0xc00044ba40, 0xc00a402680, {0xf6513a, 0x341}, 0x1062c48, {0xc000077f40, 0x2, 0x1553e18?})
	/tmp/workdir/gopath/src/golang.org/x/tools/gopls/internal/lsp/regtest/runner.go:171 +0x405
golang.org/x/tools/gopls/internal/lsp/regtest.configuredRunner.Run(...)
	/tmp/workdir/gopath/src/golang.org/x/tools/gopls/internal/lsp/regtest/regtest.go:67
golang.org/x/tools/gopls/internal/regtest/workspace.TestReloadOnlyOnce(0xc00626f8f0?)
	/tmp/workdir/gopath/src/golang.org/x/tools/gopls/internal/regtest/workspace/workspace_test.go:174 +0x115
testing.tRunner(0xc00a402680, 0x1062c50)

— watchflakes

When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988

sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988

When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988

sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988

gopherbot · 2024-03-25T18:45:36Z

Found new dashboard test flakes for:

#!watchflakes
default <- `fatal error: (?:.*\n\s*)*syscall\.forkExec` && (goos == "aix" || goos == "netbsd" || goos == "openbsd" || goos == "solaris")

2024-03-20 14:17 netbsd-amd64-9_3 go@e39af550 cmd/go.TestScript (log)

vcs-test.golang.org rerouted to http://127.0.0.1:54676
https://vcs-test.golang.org rerouted to https://127.0.0.1:54675
go test proxy running at GOPROXY=http://127.0.0.1:54674/mod
--- FAIL: TestScript (0.15s)
    --- FAIL: TestScript/test_match_only_tests (0.68s)
        script_test.go:136: 2024-03-20T14:33:27Z
        script_test.go:138: $WORK=/tmp/workdir/tmp/cmd-go-test-3941342647/tmpdir2789106918/test_match_only_tests3709676204
        script_test.go:160: 
            # Matches only tests (0.672s)
            > go test -run Test standalone_test.go
...
            fatal error: workbuf is empty

            runtime stack:
            runtime.throw({0xbe8a17?, 0x71bed46dfe58?})
            	/tmp/workdir/go/src/runtime/panic.go:1021 +0x5c fp=0x71bed46dfe08 sp=0x71bed46dfdd8 pc=0x43f81c
            runtime.(*workbuf).checknonempty(0xc00016c480?)
            	/tmp/workdir/go/src/runtime/mgcwork.go:338 +0x2c fp=0x71bed46dfe28 sp=0x71bed46dfe08 pc=0x42ed0c
            runtime.trygetfull()
            	/tmp/workdir/go/src/runtime/mgcwork.go:430 +0x53 fp=0x71bed46dfe48 sp=0x71bed46dfe28 pc=0x42f0d3
            runtime.(*gcWork).tryGet(0xc000059758)
...
            	/tmp/workdir/go/src/cmd/go/internal/work/action.go:76 +0x2d fp=0xc0009a7dc8 sp=0xc0009a7d98 pc=0x9ac24d
            cmd/go/internal/work.(*Builder).Do.func3({0xd2b340, 0x13b5b60}, 0xc0001f98c0)
            	/tmp/workdir/go/src/cmd/go/internal/work/exec.go:152 +0x7af fp=0xc0009a7f20 sp=0xc0009a7dc8 pc=0x9bb8ef
            cmd/go/internal/work.(*Builder).Do.func4()
            	/tmp/workdir/go/src/cmd/go/internal/work/exec.go:221 +0xb9 fp=0xc0009a7fe0 sp=0xc0009a7f20 pc=0x9baf79
            runtime.goexit({})
            	/tmp/workdir/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0009a7fe8 sp=0xc0009a7fe0 pc=0x479721
            created by cmd/go/internal/work.(*Builder).Do in goroutine 1
            	/tmp/workdir/go/src/cmd/go/internal/work/exec.go:207 +0x3fe
        script_test.go:160: FAIL: testdata/script/test_match_only_tests.txt:2: go test -run Test standalone_test.go: exit status 2

— watchflakes

dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-OpenBSD labels Oct 21, 2019

ianlancetaylor changed the title ~~Memory corruption on OpenBSD/amd64~~ syscall: memory corruption on OpenBSD/amd64 when forking May 20, 2020

jrick mentioned this issue Jan 24, 2021

runtime: SIGSEGV in (*mheap).alloc via mallocgc #43843

Closed

jrick mentioned this issue May 10, 2021

runtime: SIGSEGV in mmap_trampoline on openbsd-386-64 builder #46080

Closed

cherrymui mentioned this issue May 14, 2021

runtime: SIGBUS / SIGSEGV during asmcgocall #46170

Open

bcmills added this to the Backlog milestone May 14, 2021

This was referenced Jun 15, 2021

runtime: "morestack on g0" on freebsd-386 #45887

Open

runtime: possible memory corruption on FreeBSD #46272

Closed

jrick mentioned this issue Nov 30, 2021

runtime,cmd/compile: frequent memory corruption on NetBSD and OpenBSD since 2021-10-11 #49209

Closed

prattmic changed the title ~~syscall: memory corruption on OpenBSD/amd64 when forking~~ syscall: memory corruption on OpenBSD/amd64 and NetBSD/amd64,386 when forking Dec 3, 2021

bcmills added OS-NetBSD OS-Solaris OS-AIX labels Sep 14, 2022

mknyszek mentioned this issue Sep 28, 2022

runtime: flaky corruption on openbsd #55161

Closed

bcmills mentioned this issue Oct 4, 2022

os/exec: panic in (*Cmd).Start.func2 on openbsd-amd64-68 #52801

Closed

bcmills mentioned this issue Oct 20, 2022

build: unrecognized failures on openbsd-386-70 #56338

Closed

bcmills added this to Test Flakes Oct 20, 2022

bcmills mentioned this issue Dec 14, 2022

runtime: TestCgoPprofCallback hang on linux-arm #54778

Closed

davepacheco mentioned this issue Dec 16, 2022

cockroachdb crashed in Go runtime during test run: s.allocCount != s.nelems oxidecomputer/omicron#1146

Closed

bcmills mentioned this issue Jan 25, 2023

Testing packages.: unrecognized failures #57995

Closed

This comment was marked as off-topic.

Sign in to view

lnproxy mentioned this issue Feb 15, 2023

[bug]: lnd stops responding and becomes unkillable on openbsd 7.2 lightningnetwork/lnd#7409

Closed

This comment was marked as off-topic.

Sign in to view

bcmills mentioned this issue Apr 17, 2023

runtime: netpoll failed on darwin #59679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syscall: memory corruption when forking on OpenBSD, NetBSD, AIX, and Solaris #34988

syscall: memory corruption when forking on OpenBSD, NetBSD, AIX, and Solaris #34988

jrick commented Oct 18, 2019 •

edited by bcmills

Loading

jrick commented Oct 18, 2019

randall77 commented Oct 18, 2019

mknyszek commented Oct 18, 2019

jrick commented Oct 18, 2019

jrick commented Oct 18, 2019

jrick commented Oct 18, 2019 •

edited

Loading

jrick commented Dec 6, 2019

jrick commented May 20, 2020 •

edited

Loading

ianlancetaylor commented May 20, 2020

jrick commented May 22, 2020

jrick commented Sep 14, 2020 •

edited

Loading

bcmills commented May 14, 2021 •

edited

Loading

bcmills commented Dec 1, 2021

jrick commented Dec 1, 2021

jrick commented Dec 1, 2021 •

edited

Loading

jrick commented Dec 2, 2021

prattmic commented Dec 2, 2021

prattmic commented Dec 3, 2021 •

edited

Loading

prattmic commented Dec 3, 2021 •

edited

Loading

gopherbot commented Oct 6, 2022

bcmills commented Nov 3, 2022

davepacheco commented Dec 19, 2022

gopherbot commented Jan 6, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

gopherbot commented Mar 25, 2024

syscall: memory corruption when forking on OpenBSD, NetBSD, AIX, and Solaris #34988

syscall: memory corruption when forking on OpenBSD, NetBSD, AIX, and Solaris #34988

Comments

jrick commented Oct 18, 2019 • edited by bcmills Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

jrick commented Oct 18, 2019

randall77 commented Oct 18, 2019

mknyszek commented Oct 18, 2019

jrick commented Oct 18, 2019

jrick commented Oct 18, 2019

jrick commented Oct 18, 2019 • edited Loading

jrick commented Dec 6, 2019

jrick commented May 20, 2020 • edited Loading

ianlancetaylor commented May 20, 2020

jrick commented May 22, 2020

jrick commented Sep 14, 2020 • edited Loading

bcmills commented May 14, 2021 • edited Loading

bcmills commented Dec 1, 2021

jrick commented Dec 1, 2021

jrick commented Dec 1, 2021 • edited Loading

jrick commented Dec 2, 2021

prattmic commented Dec 2, 2021

prattmic commented Dec 3, 2021 • edited Loading

prattmic commented Dec 3, 2021 • edited Loading

gopherbot commented Oct 6, 2022

bcmills commented Nov 3, 2022

davepacheco commented Dec 19, 2022

gopherbot commented Jan 6, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

gopherbot commented Mar 25, 2024

jrick commented Oct 18, 2019 •

edited by bcmills

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

jrick commented Oct 18, 2019 •

edited

Loading

jrick commented May 20, 2020 •

edited

Loading

jrick commented Sep 14, 2020 •

edited

Loading

bcmills commented May 14, 2021 •

edited

Loading

jrick commented Dec 1, 2021 •

edited

Loading

prattmic commented Dec 3, 2021 •

edited

Loading

prattmic commented Dec 3, 2021 •

edited

Loading