-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: frequent GC related asserts seen on illumos based OSes (but only on AMD!) #53289
Comments
I found #49209 (frequent memory corruption on NetBSD and OpenBSD) that says:
and
and links to https://gnats.netbsd.org/56535 - I'm trying to get forkstress.c running on one of my illumos based systems now to see if that's related. |
Thanks for the report and for looking into it! It would be interesting if Solaris has similar issue as the BSD bug. We saw some GC-related failures (like cc @golang/solaris @golang/runtime |
I haven't been able to get the |
FWIW, the https://github.com/prattmic/go-bsd-corruption-issue49209 contains a similar Go program I used prior to getting the C version working (you'll need new assembly, of course). That said, you show a failure in TestSO above. That program doesn't fork, which seems to indicate that if this is related to fork it would be the parent corrupting the child rather than the other way around. FreeBSD had a memory corruption bug due to an incorrect XSAVE missing some AVX state: #46272 (comment). No particular reason for this specific issue to be the same on illumos, but something to look into. |
I don't think illumos has the same problem, there's 256 bytes of "ymm" size reserved here.. Trying out the linked go-bsd-corruption-issue49209 repro now. |
I believe go will use libc for OSes that are not Linux, so I rewrote the go-bsd-corruption-issue49209 code in Rust to use libc: [dependencies]
libc = "0.2.126"
num_cpus = "1.13.1" use std::thread;
fn main() {
for i in 0..(num_cpus::get() * 100) {
thread::spawn(move || {
loop {
let pid = unsafe { libc::fork() };
if pid == 0 {
// child
unsafe { libc::exit(0); }
} else {
// parent
}
unsafe { libc::waitpid(pid, 0 as *mut libc::c_int, 0); }
}
});
}
loop {
unsafe { libc::getpid(); }
}
} This runs ok under illumos. |
cc @golang/runtime @golang/illumos |
Just a note that Go will use libc on systems that require it, which includes Solaris/Illumos. Go generally avoids libc on systems that permit direct syscalls, such as Linux and NetBSD and FreeBSD. |
@jmpesp would it be possible to provide a few full logs of the failure? It is possible that the logs contain the addresses that can give a hint to the bug (not always, but sometimes helpful). Thanks. |
Sure thing:
|
Here's another:
|
Note I hit a problem on a physical OmniOS machine in the first part of the reproduction script that builds If I set the environment to:
the build fails with:
|
I was curious about this, so I installed FreeBSD 13 and NetBSD 9.2 on physical machines and ran the reproduction script. FreeBSD is not hitting any errors, but NetBSD failed on the first loop iteration with
|
Any updates on this? |
Is there more information that would be useful for debugging this? We're still seeing similar runtime failures and would love to get this understood. |
I tried to reproduce this using @jmpesp's script. After a few hours, it failed with:
I had set This is on an AMD Ryzen 7 3700X 8-core system (16 software cores). |
@davepacheco if you have time to dig in to this again, I'd recommend testing something like 1.19.2. Also, it's very very very unlikely to be related, but just to eliminate a possible culprit, are you running on an illumos kernel that includes the fix for https://www.illumos.org/issues/14898 ? |
@nshalman Thanks. My previous run was on a kernel that predated that fix. I updated to one that contains that fix and I'm still seeing similar issues. Thanks for the heads-up. On using 1.19.2: is that on general principles of trying on latest, or is there a particular fix that you're thinking of that might be related? |
General principles; as I understand it, 1.17 is no longer supported. I was able to trip some failures on my AMD machine with 1.19.2 so I agree that something is still broken. |
Yeah, I just ran with 1.19.2 and got:
|
I have this problem too, it happens every day. use go1.17.13.linux-armv6l.tar.gz on Yogurt (Phytec Example Distribution) BSP-Yocto-i.MX6-PD18.1.2 Cgo is useful in the project, Is it because of cgo? runtime: s.allocCount= 340 s.nelems= 341 goroutine 17 [running, locked to thread]: runtime: s.allocCount= 511 s.nelems= 512 goroutine 69009 [running]: |
@sixhj Can you please file a new issue including the following information?
Thanks! |
I think there's a good chance that @jmpesp and I were seeing a combination of https://www.illumos.org/issues/15367 and https://www.illumos.org/issues/15367. There's a detailed analysis of how those illumos issues can result in this behavior in oxidecomputer/omicron#1146. We've had fixes available for those illumos issues for some time now, and I finally went back and tried to reproduce this using James's script above. It's been running for 1d15h without a crash. I did run into one hang after 4h during the first attempt. I didn't dig too deeply into it. I've no particular reason to suspect it's related, and it is Go 1.17, and I can't spend a lot of time on it right now. It hung at this point:
The leaf process is |
What version of Go are you using (
go version
)?This problem reproduces for several versions of 1.16, release 1.17, and release 1.18.
Does this issue reproduce with the latest release?
Yes (
go1.18.3 solaris/amd64
)What operating system and processor architecture are you using (
go env
)?We've seen errors on AMD Ryzen and EPYC, running OmniOS, and have not been able to reproduce on Xeon systems.
go env
OutputWhat did you do?
The errors we saw are in the "Testing packages" phase of
all.bash
. To reproduce from a fresh OmniOS system, the following script was run:See #45775 (comment) for instructions on reproducing this on OmniOS under qemu.
What did you expect to see?
ALL TESTS PASSED
What did you see instead?
Eventually, the while loop in that reproduction script fails. We've seen the following:
fatal error: s.allocCount != s.nelems && freeIndex == s.nelems
The text was updated successfully, but these errors were encountered: