-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: can not read stacktrace from a core file #25218
Comments
I assume this is all with GOTRACEBACK=crash. GDB is breaking at the first SIGABRT, which isn't what actually kills the process:
So that much makes sense. Is the real complaint here that the application's stack trace doesn't appear in the core dump? |
Yes. I would expect some goroutine's stack to have a frame for main.main and main.crash in it, at least. |
Well, it would still be on the goroutine's stack, but the thread is on the signal handler stack when it dies, not the goroutine stack. And IIRC Austin said there's no way to convince GDB to show you a different stack for a thread while handling a core file without writing a custom backtracer and even then it was very hard or didn't work or something. My inclination would be to deregister the signal handler for sig in dieFromSignal. If the first signal kills the process then you'd get the first stack in the core file, which is what we want. But all this stuff is incredibly subtle, so there's probably a reason that wouldn't work, maybe some cgo interop thing. Does Delve do any better than GDB here? In principle it could show the goroutine stack along with gsignal. |
It used to do better, but it doesn't anymore after that commit. In particular:
doesn't look true to me. What it looks like to me is that the thread is running on the normal stack of goroutine 1, if it was a signal handling stack it would have goid == 0, right? Also, how does sigfwdgo call dieFromSignal? |
Where are you seeing goroutine 1 that you trust? The panic backtrace shows goroutine 1, but that happens before dieFromSignal. gdb shows 1 in info goroutines, but it combines scheduler information with thread state in a way that'll just end up showing whatever the thread's up to, not the user code that was running for goroutine 1.
Line 637 in 28b40f3
or am I missing something? I'm probably out of my depth at this point. Hopefully Austin or Elias can weigh in. |
I'm seeing it while debugging delve, the TLS for the first thread contains a pointer to the same g struct that's in the first entry of runtime.allgs. And it has goid == 1.
oh I didn't see that. |
If you want to handle this case I think you have to. This area of the code is responsible for handling signals that might be caused by C code, so it can't blindly muck with Go stuff until it knows what's up. The setg call you hoped had run is here: Line 343 in 28b40f3
and only runs if sigfwdgo doesn't do anything. sigtrampgo checks the stack pointer to decide if gsignal or g0 is running: Lines 307 to 308 in 28b40f3
(I still kinda think clearing out the signal handler in dieFromSignal is reasonable.) |
Ok, the thread's sp is actually inside gsignal's stack. Do you know where the sp for the normal goroutine stack is saved in this case? g.sched.sp is zero. |
That's a good point. The goroutine wasn't preempted normally, so nothing in the runtime will know what its stack pointer was. The only place you could find it is the ctx argument to the signal handler (the thing we wrap a sigctxt around), and even that will be awkward to interpret in the case of a chain of signal handlers. |
I think the concern here is if the signal needs to be forwarded. E.g., if you have some SIGABRT-trapping crash handling service installed, we want the signal to get forwarded to it. Maybe dieFromSignal could check if we're forwarding it (which we never will be in pure Go programs) and, if not, go straight to its fallback path that clears the handler and raises? |
It is not so crazy to use the signal context to figure out where a signal handler was invoked. That is what gdb does. The key point is to reliably determine whether you are running in a signal handler. I think gdb does that by recognizing the signal trampoline that is on the return stack, which resumes normal execution if the signal handler returns. |
I don't think the chained signal handlers are a problem. We don't manipulate the context when calling down the chain. I was thinking we might be able to stash this away in |
I don't think we need to do anything in |
If we don't forward the signal, we potentially bypass crash reporters (particularly on iOS and Android) and skew people's crash metrics. But I think I'm missing something here:
The saved base pointer in the root frame of the signal stack should point back to the goroutine stack, right? (Do we know why the debuggers aren't following that base pointer? Am I just confused?) |
Since that was sort of addressed to me, I'll say that looking with GDB I think you're right that following the base pointer chain gets you to the right stack, and that seems better than using the signal handler context. Based on Stack Overflow I guess GDB only follows the base pointers when it's out of options. |
Following the base pointer will presumably only work when the base pointer is used by the code being interupted, so it will be unreliable if the signal fires while executing C code compiled with |
I think gdb also won't follow a frame pointer if it points to a smaller address, to avoid walking corrupted stacks or going in cycles. But why doesn't it unwind through the signal context? That I would expect to work here. |
Maybe it's because you aren't telling it where you saved rdx on the stack? |
Never mind, I see now that x86_64_fallback_frame_state always knows how to get the context. My next guess is that, if gdb's documentation is correct, x86_64_fallback_frame_state is only used when no FDE is found, which is not the case for runtime.sigreturn. |
@aarzilli, did you want anything else here? |
I'm satisfied by blacklisting runtime.sigreturn's FDE on delve's side, I've left this issue open because I suspect it would be useful to other debuggers to remove it entirely from the executable. |
Change https://go.dev/cl/479096 mentions this issue: |
With https://go.dev/cl/479096, backtrace from a core looks like so:
|
@aarzilli based on the prior comment, I'm not sure if you expect delve to work better or not, but it (mostly?) does not.
Interestingly, the main thread works ok (except 2 frame 15s?):
But other threads don't:
This seems to be using
After https://go.dev/cl/479096, behavior is mostly the same, except it chokes a bit more on sigreturn__sigaction:
|
Change https://go.dev/cl/479557 mentions this issue: |
https://go.dev/cl/479096 fixed 386 and amd64, but all other architectures I've checked (arm64, mips64, riscv) are still broken (Silent skips mask some of these failures from the dashboard). Reopening. These other architectures have sigreturn defined in the VDSO, and GDB does seem to have appropriate matching logic for the VDSO, so I suspect we aren't handling the frame correctly in |
Change https://go.dev/cl/479518 mentions this issue: |
For #25218. Change-Id: I4024a2064e0f56755fe40eb7489ba28eb4358c60 Reviewed-on: https://go-review.googlesource.com/c/go/+/479518 Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Cherry Mui <[email protected]> Auto-Submit: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
@prattmic I'm looking into fixing dlv's stacktraces on go1.21 given the changes that have been made. As far as gdb can generate a good stacktrace assume it's a delve problem. However I don't understand the stacktraces you have posted, how was the core file generated? |
PS. sorry for the delay in responding to this and not saying anything in the CL, I didn't have time last week. |
You can use a simplified program from the test I added:
and other threads:
There is a lot going on here. Some of the highlights:
The reason that they are all in SIGQUIT handlers is that we send SIGQUIT to every thread before crashing. This is used to print the per-thread stack dump at the end of the panic output: Thread dump example----- Once the last thread gets SIGQUIT, (FWIW, this is a bit overly complex; the first dieFromSignal could tell that there is nowhere to forward the signal to and just die directly) |
It doesn't look like, in this case, gdb's output is very good: none of the threads have a frame with main.main, which you would expect to see. I'm getting the same with delve with my local improvements btw. |
Container-optimized OS sets kernel.core_pattern = "|/bin/false", effectively disabling core dump creation regardless of RLIMIT_CORE. We have tests that want to analyze core dumps, so reset core_pattern back to the default value on boot. For golang/go#25218. Change-Id: I7e3cc7496a5428326855cc687b87cb4da76fdd66 Reviewed-on: https://go-review.googlesource.com/c/build/+/479557 Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Heschi Kreinick <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Presumably this is because the goroutine that executes main.main isn't running on any thread, but if I stacktrace goroutine 1 with delve I get this:
which is still wrong, but not obviously delve's fault. |
That can be expected. I actually just merged https://go.dev/cl/478975 to optionally disable this because it makes debugging the scheduler more painful when panic scribbles all over scheduler state. In #25218 (comment), thread 5 actually does show main.main. I suspect this is because that thread happened to be the one to receive the original SIGQUIT. |
@prattmic uh, you're right, I didn't see it. I have the same thing in my core file and delve does the same thing as gdb. I was expecting main.main to be on a thread that's running a goroutine, but it isn't. The state of the process is pretty weird when taking a core dump like this. |
For golang#25218. Change-Id: I4024a2064e0f56755fe40eb7489ba28eb4358c60 Reviewed-on: https://go-review.googlesource.com/c/go/+/479518 Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Cherry Mui <[email protected]> Auto-Submit: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?This problem was introduced by b1d1ec9 (CL 110065) and is still present at tip.
Does this issue reproduce with the latest release?
No.
What operating system and processor architecture are you using (
go env
)?What did you do?
Given:
Running it under gdb will produce this stacktrace:
however letting it produce a core file then reading the core file with gdb produces this:
I'm not sure what's happening here. Is the signal handler running and overwriting part of the stack?
The text was updated successfully, but these errors were encountered: