-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: hang after concurrent panic from two threads after apparent memory corruption #57420
Comments
I think as long as there is memory corruption, default analysis of this bug is to blame everything on that. It's also possible that the Go runtime for Illumos has a bug in it because it is undertested, but given memory corruption, nothing is certain. Also, go1.17 is kinda old and unsupported, you might try tip, we do have an illumos-amd64 builder and it is looking good ( https://build.golang.org/ ). We don't have recent coverage for 1.19, 1.18 or golang.org/x (not failures, just, no runs) so I cannot say for sure about those. CC @golang/illumos (yes I know this is just one guy at Oxide, I looked first to be sure it wasn't empty like /solaris). Not sure what the process is for more illumos builders, I can ask today (we have also have a 1.19 builder coverage problem for loong64 that just bit us this week, so, ugh). |
I took a look at what is going on:
There is a lock ordering consistency problem here between
cc @golang/runtime |
Another option may be to have |
Thanks @prattmic for taking a look. I'm glad it turned out to be worthwhile (i.e., sounds like a real issue here). |
@dr2chase Thanks. I'll look into the question of illumos builders for those recent releases. |
The instructions for adding a builder are hard to find using search engines; they're in the Wiki here: https://github.com/golang/go/wiki/DashboardBuilders |
@dr2chase Thanks. So we do already have one, but you mentioned it's not running several recent releases. Is there some config that needs to be changed somewhere to cover those too? |
I am not good at builders, there is absolutely a config that needs to be changed somewhere. There's only 1 builder, a second one would be nice. |
What version of Go are you using (
go version
)?This is the
cockroach
binary, running the simplecockroach version --build-tag
command. A successful invocation prints:That same Go version on my system prints:
Does this issue reproduce with the latest release?
Unknown -- I've only ever seen it once.
What operating system and processor architecture are you using (
go env
)?I'm not positive that this is the same build that was used to build the
cockroach
binary, but I suspect it is:go env
OutputWhat did you do?
Ran
cockroach version --build-tag
What did you expect to see?
The output above, showing cockroach version information
What did you see instead?
No output. The command hung. I saved a core file and have a lot more information about it though!
More details
I spent some time digging into this under oxidecomputer/omicron#1876. The punchline is that this system suffered from a bug where we weren't preserving %ymm0 across signal handlers, which results in memclrNoHeapPointers not properly zeroing address ranges. In this specific case, from the core file, it looks like most of one of the GC bits arenas is not properly cleared.
There are two threads panicking in my core file. Thread 1 is panicking on "sweep increased allocation count", which is a known possible consequence of this %ymm0 issue. Thread 14 is panicking on reportZombies, another known possible consequence of the same issue. For details on how this corruption can cause these failures, see oxidecomputer/omicron#1146.
So at this point, we've got good reason to believe there's memory corruption here. However, I don't know how to tell if this is why Go ultimately hung. I figured I'd file this and let y'all decide how far it's worth digging into this. There could be a legit issue here related to hang during concurrent panic.
Stacks:
I'm happy to make the core file available but I'm not sure the best way to do that.
The text was updated successfully, but these errors were encountered: