-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syscall: memory corruption when forking on OpenBSD, NetBSD, AIX, and Solaris #34988
Comments
I missed that 1.13.3 was also released yesterday. Currently updating to that and will report whether this is still an issue. |
This looks like cmd/go crashing while building the test, not the test itself. |
@jrick maybe you meant this in your original post, but I just want to be clear. Does this reproduce with Go 1.12.X or older versions of Go? Since we have a reasonable reproducer, the next step to me would be to just bisect what went into Go 1.13, if we know it isn't reproducing in Go 1.12. I genuinely have no idea what this could be. I thought at first that it could be scavenging related but that's highly unlikely for a number of reasons. I won't rule it out yet, though. |
I haven't tested 1.12.x but will follow up testing that next. Currently hammering this test with 1.13.3 and so far it has not failed, but my application built with 1.13.3 still fails with SIGBUS (could be unrelated). |
@mknyszek it still hasn't failed on 1.13.3 (running close to an hour now) but quickly failed on 1.12.12. |
1.13.3 finally errored after an hour. more errors from 1.13.3: |
This remains a problem in 1.13.5, so it's not addressed by the recent fixes to the go tool. |
This may be fork/exec related. This program exhibits similar crashes on OpenBSD 6.7 and Go 1.14.3.
crash trace: https://gist.github.com/jrick/8d6ef72796a772668b891310a18dd805 Synchronizing the os/exec call with an additional mutex appears to remove the crash. |
Thanks for the stack trace. That looks very much like a forked child process is changing the memory seen by the parent process. Which should of course be impossible. Specifically it seems that |
I'm seeing another strange thing in addition to that crash. Sometimes the program will run forever, spinning cpu, but appears to be deadlocked because none of the pids of those true processes are ever changing. Here's the trace after sending sigquit: https://gist.github.com/jrick/74aaa63624961145b7bc7b9518da75e1 |
I am currently testing with this OpenBSD kernel patch to the virtual memory system: https://marc.info/?l=openbsd-tech&m=160008279223088&w=2 however these crashes still persist. Another interesting data point: so far it appears that this only reproduces on amd ryzen cpus, and not any intel ones. |
https://build.golang.org/log/3f45171bc52a0a86435abb9f795c0e8a45c4a0b0 looks similar:
|
https://storage.googleapis.com/go-build-log/abee19ae/openbsd-amd64-68_0f13ec3d.log (a TryBot) looks like it could plausibly be from a fork syscall. |
I'm not sure when this changed but since returning to this issue I haven't been able to reproduce with my minimal test case again on the same hardware with OpenBSD 7.0-current and Go 1.17.3. I suspect it's due to some OpenBSD fix if the 6.8 builders are still hitting this. (also 6.8 is no longer a supported OpenBSD version; i don't think it makes much sense to continue testing with it) |
spoke too soon:
|
and it took far longer than 1.17.3 but a very similar crash (in scanstack) still occurs with
|
I can also reproduce crashes on netbsd-386 and netbsd-amd64 with #34988 (comment) on AMD, of the form:
as well as #49453 |
Some observations I've made (from netbsd-amd64): The crashes still seem to occur with GOMAXPROCS=1, however Go still has some background threads in this case. Disabling sysmon and GC makes this program truly single-threaded:
Once the program is truly single-threaded, the crashes disappear. Setting GOMAXPROCS=2 with this patch brings the crashes back. Here is a slightly simplified reproducer: package main
import (
"os/exec"
"runtime"
)
func main() {
go func() {
for {
err := exec.Command("/usr/bin/true").Run()
if err != nil {
panic(err)
}
}
}()
for {
runtime.Gosched()
}
} This version has only a single forker, but crashes about as quickly. The (cc @aclements @mknyszek) |
More observations:
I've simplified that repro even further:
package main
import (
//"runtime"
"syscall"
)
func fork() int32
func main() {
go func() {
for {
pid := fork()
syscall.Syscall6(syscall.SYS_WAIT4, uintptr(pid), 0, 0, 0, 0, 0)
//syscall.RawSyscall6(syscall.SYS_WAIT4, uintptr(pid), 0, 0, 0, 0, 0)
}
}()
for {
syscall.Syscall(syscall.SYS_GETPID, 0, 0, 0)
//runtime.Gosched()
}
}
The key parts here:
The crashes I get with this look like (source):
This is complaining that the assertion The one case I've caught in GDB looks like (stopped just inside the failing branch):
From the assembly,
Of course, I can't really tell if that memory location read as zero, or if the register was cleared after the load somehow. |
Change https://go.dev/cl/439196 mentions this issue: |
This cuts the wall duration for 'go test os/exec' and 'go test -race os/exec' roughly in half on my machine, which is an even more significant speedup with a high '-count'. For better or for worse, it may also increase the repro rate of #34988. Tests that use Setenv or Chdir or check for FDs opened during the test still cannot be parallelized, but they are only a few of those. Change-Id: I8d284d8bff05787853f825ef144aeb7a4126847f Reviewed-on: https://go-review.googlesource.com/c/go/+/439196 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Bryan Mills <[email protected]> Auto-Submit: Bryan Mills <[email protected]>
While debugging oxidecomputer/omicron#1146 I saw that this bug mentions Solaris and wondered if it might affect illumos as well, since the failure modes look the same for my issue. For the record, I don't think my issue was caused by this one. I ran the Go and C test programs for several days without issue, and I ultimately root-caused my issue to illumos#15254. I mention this in case anyone in the future is wondering if illumos is affected by this. I don't know whether Solaris (or any other system) has the same issue with preserving the %ymm registers across signal handlers, but that can clearly cause the same failure modes shown here. |
Found new dashboard test flakes for:
2023-01-06 17:30 netbsd-amd64-9_3 tools@36bd3dbc go@476384ec x/tools/gopls/internal/regtest/workspace.TestReloadOnlyOnce (log)
|
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988
sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988
sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988
sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988
When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988
sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988
sys/uvm/uvm_fault.c: revision 1.234 uvm: prevent TLB invalidation races during COW resolution When a thread takes a page fault which results in COW resolution, other threads in the same process can be concurrently accessing that same mapping on other CPUs. When the faulting thread updates the pmap entry at the end of COW processing, the resulting TLB invalidations to other CPUs are not done atomically, so another thread can write to the new writable page and then a third thread might still read from the old read-only page, resulting in inconsistent views of the page by the latter two threads. Fix this by removing the pmap entry entirely for the original page before we install the new pmap entry for the new page, so that the new page can only be modified after the old page is no longer accessible. This fixes PR 56535 as well as the netbsd versions of problems described in various bug trackers: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584 https://reviews.freebsd.org/D14347 golang/go#34988
Found new dashboard test flakes for:
2024-03-20 14:17 netbsd-amd64-9_3 go@e39af550 cmd/go.TestScript (log)
|
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I observed these issues in one of my applications, and assumed it was a race or invalid unsafe.Pointer usage or some other fault of the application code. When the 1.13.2 release dropped yesterday I built it from source and observed a similar issue running the regression tests. The failed regression test does not look related to the memory corruption, but I can reproduce the problem by repeatedly running the test in a loop:
It can take several minutes to observe the issue but here are some of the captured panics and fatal runtime errors:
https://gist.githubusercontent.com/jrick/f8b21ecbfbe516e1282b757d1bfe4165/raw/6cf0efb9ba47ba869f98817ce945971f2dff47d6/gistfile1.txt
https://gist.githubusercontent.com/jrick/9a54c085b918aa32910f4ece84e5aa21/raw/91ec29275c2eb1be49f62ad8a01a5317ad168c94/gistfile1.txt
https://gist.githubusercontent.com/jrick/8faf088593331c104cc0da0adb3f24da/raw/7c92e7e7d60d426b2156fd1bdff42e0717b708f1/gistfile1.txt
https://gist.githubusercontent.com/jrick/4645316444c12cd815fb71874f6bdfc4/raw/bffac2a448b07242a538b77a2823c9db34b6ef6f/gistfile1.txt
https://gist.githubusercontent.com/jrick/3843b180670811069319e4122d32507a/raw/0d1f897aa25d91307b04ae951f1b260f33246b61/gistfile1.txt
https://gist.githubusercontent.com/jrick/99b7171c5a49b4b069edf06884ad8e17/raw/740c7b9e8fa64d9ad149fd2669df94e89c466927/gistfile1.txt
Additionally, I observed
go run
hanging (no runtime failure due to deadlock) and it had to be killed with SIGABRT to get a trace: https://gist.githubusercontent.com/jrick/d4ae1e4355a7ac42f1910b7bb10a1297/raw/54e408c51a01444abda76dc32ac55c2dd217822b/gistfile1.txtIt may not matter which regression test is run as the errors also occur in run.go.
The text was updated successfully, but these errors were encountered: