Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: can not read stacktrace from a core file #25218

Open
aarzilli opened this issue May 2, 2018 · 35 comments
Open

runtime: can not read stacktrace from a core file #25218

aarzilli opened this issue May 2, 2018 · 35 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Debugging NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@aarzilli
Copy link
Contributor

aarzilli commented May 2, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

This problem was introduced by b1d1ec9 (CL 110065) and is still present at tip.

Does this issue reproduce with the latest release?

No.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/a/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/a/n/go/"
GORACE=""
GOROOT="/usr/local/go-tip"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go-tip/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build495464568=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Given:

package main

import "runtime/debug"

func main() {
	debug.SetTraceback("crash")
	crash()
}

func crash() {
	panic("panic!")
}

Running it under gdb will produce this stacktrace:

(gdb) bt
#0  0x000000000044fff4 in runtime.raise () at <autogenerated>:1
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at /usr/local/go-tip/src/runtime/signal_unix.go:424
#2  0x0000000000438dea in runtime.crash () at /usr/local/go-tip/src/runtime/signal_unix.go:526
#3  0x0000000000426487 in runtime.fatalpanic (msgs=<optimized out>) at /usr/local/go-tip/src/runtime/panic.go:696
#4  0x0000000000425e6b in runtime.gopanic (e=...) at /usr/local/go-tip/src/runtime/panic.go:502
#5  0x0000000000470ac9 in main.crash () at /home/a/temp/simple.go:11
#6  0x0000000000470a7b in main.main () at /home/a/temp/simple.go:7

however letting it produce a core file then reading the core file with gdb produces this:

$ gdb ./simple simple-core
...
(gdb) bt
#0  0x000000000044fff4 in runtime.raise () at <autogenerated>:1
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at /usr/local/go-tip/src/runtime/signal_unix.go:424
#2  0x00000000004390a8 in runtime.sigfwdgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=6, ~r3=<optimized out>) at /usr/local/go-tip/src/runtime/signal_unix.go:637
#3  0x0000000000438488 in runtime.sigtrampgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=<optimized out>) at /usr/local/go-tip/src/runtime/signal_unix.go:289
#4  0x00000000004502e3 in runtime.sigtramp () at <autogenerated>:1
#5  0x00000000004503d0 in ?? () at <autogenerated>:1
#6  0x0000000000000001 in ?? ()
#7  0x0000000000000000 in ?? ()

I'm not sure what's happening here. Is the signal handler running and overwriting part of the stack?

@aarzilli
Copy link
Contributor Author

aarzilli commented May 2, 2018

cc @eliasnaur @aclements

@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 2, 2018
@andybons andybons added this to the Unplanned milestone May 2, 2018
@heschi
Copy link
Contributor

heschi commented May 2, 2018

I assume this is all with GOTRACEBACK=crash. GDB is breaking at the first SIGABRT, which isn't what actually kills the process:

(gdb) run
Starting program: panic 
panic: panic!
[snip]
Program received signal SIGABRT, Aborted.
runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
146		RET
(gdb) bt
#0  runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at go/src/runtime/signal_unix.go:424
#2  0x0000000000438dea in runtime.crash () at go/src/runtime/signal_unix.go:526
#3  0x0000000000426497 in runtime.fatalpanic (msgs=<optimized out>) at go/src/runtime/panic.go:696
#4  0x0000000000425e7b in runtime.gopanic (e=...) at go/src/runtime/panic.go:502
#5  0x00000000004709c9 in main.crash () at go/src/panic.go:11
#6  0x000000000047097b in main.main () at go/src/panic.go:7
(gdb) c
Continuing.

Program received signal SIGABRT, Aborted.
runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
146		RET
(gdb) bt
#0  runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at go/src/runtime/signal_unix.go:424
#2  0x00000000004390a8 in runtime.sigfwdgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=6, ~r3=<optimized out>) at go/src/runtime/signal_unix.go:637
#3  0x0000000000438488 in runtime.sigtrampgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=<optimized out>) at go/src/runtime/signal_unix.go:289
#4  0x00000000004502a3 in runtime.sigtramp () at go/src/runtime/sys_linux_amd64.s:353
#5  0x0000000000450390 in ?? () at go/src/runtime/sys_linux_amd64.s:437
#6  0x0000000000000007 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb) c
Continuing.

Program terminated with signal SIGABRT, Aborted.
The program no longer exists.

So that much makes sense. Is the real complaint here that the application's stack trace doesn't appear in the core dump?

@aarzilli
Copy link
Contributor Author

aarzilli commented May 2, 2018

So that much makes sense. Is the real complaint here that the application's stack trace doesn't appear in the core dump?

Yes. I would expect some goroutine's stack to have a frame for main.main and main.crash in it, at least.

@heschi
Copy link
Contributor

heschi commented May 2, 2018

Well, it would still be on the goroutine's stack, but the thread is on the signal handler stack when it dies, not the goroutine stack. And IIRC Austin said there's no way to convince GDB to show you a different stack for a thread while handling a core file without writing a custom backtracer and even then it was very hard or didn't work or something.

My inclination would be to deregister the signal handler for sig in dieFromSignal. If the first signal kills the process then you'd get the first stack in the core file, which is what we want. But all this stuff is incredibly subtle, so there's probably a reason that wouldn't work, maybe some cgo interop thing.

Does Delve do any better than GDB here? In principle it could show the goroutine stack along with gsignal.

@aarzilli
Copy link
Contributor Author

aarzilli commented May 2, 2018

Does Delve do any better than GDB here? In principle it could show the goroutine stack along with gsignal.

It used to do better, but it doesn't anymore after that commit. In particular:

Well, it would still be on the goroutine's stack, but the thread is on the signal handler stack when it dies, not the goroutine stack

doesn't look true to me. What it looks like to me is that the thread is running on the normal stack of goroutine 1, if it was a signal handling stack it would have goid == 0, right? Also, how does sigfwdgo call dieFromSignal?

@heschi
Copy link
Contributor

heschi commented May 2, 2018

Where are you seeing goroutine 1 that you trust? The panic backtrace shows goroutine 1, but that happens before dieFromSignal. gdb shows 1 in info goroutines, but it combines scheduler information with thread state in a way that'll just end up showing whatever the thread's up to, not the user code that was running for goroutine 1.

Also, how does sigfwdgo call dieFromSignal?

dieFromSignal(sig)

or am I missing something?

I'm probably out of my depth at this point. Hopefully Austin or Elias can weigh in.

@aarzilli
Copy link
Contributor Author

aarzilli commented May 2, 2018

Where are you seeing goroutine 1 that you trust?

I'm seeing it while debugging delve, the TLS for the first thread contains a pointer to the same g struct that's in the first entry of runtime.allgs. And it has goid == 1.
I was under the impression that when a thread is running on one of those special stacks the pointer to g is changed accordingly, am I wrong? Should I check if sp is inside [g.m.gsignal.stack.stackhi, g.m.gsignal.stack.stacklo]?

or am I missing something?

oh I didn't see that.

@heschi
Copy link
Contributor

heschi commented May 2, 2018

I was under the impression that when a thread is running on one of those special stacks the pointer to g is changed accordingly, am I wrong? Should I check if sp is inside [g.m.gsignal.stack.stackhi, g.m.gsignal.stack.stacklo]?

If you want to handle this case I think you have to. This area of the code is responsible for handling signals that might be caused by C code, so it can't blindly muck with Go stuff until it knows what's up. The setg call you hoped had run is here:

setg(g.m.gsignal)

and only runs if sigfwdgo doesn't do anything. sigtrampgo checks the stack pointer to decide if gsignal or g0 is running:
if sp < g.m.gsignal.stack.lo || sp >= g.m.gsignal.stack.hi {
if sp >= g.m.g0.stack.lo && sp < g.m.g0.stack.hi {
so that ought to be a decent model to follow.

(I still kinda think clearing out the signal handler in dieFromSignal is reasonable.)

@aarzilli
Copy link
Contributor Author

aarzilli commented May 2, 2018

Ok, the thread's sp is actually inside gsignal's stack. Do you know where the sp for the normal goroutine stack is saved in this case? g.sched.sp is zero.

@heschi
Copy link
Contributor

heschi commented May 2, 2018

That's a good point. The goroutine wasn't preempted normally, so nothing in the runtime will know what its stack pointer was. The only place you could find it is the ctx argument to the signal handler (the thing we wrap a sigctxt around), and even that will be awkward to interpret in the case of a chain of signal handlers.

@aclements
Copy link
Member

I still kinda think clearing out the signal handler in dieFromSignal is reasonable.

I think the concern here is if the signal needs to be forwarded. E.g., if you have some SIGABRT-trapping crash handling service installed, we want the signal to get forwarded to it. Maybe dieFromSignal could check if we're forwarding it (which we never will be in pure Go programs) and, if not, go straight to its fallback path that clears the handler and raises?

cc @ianlancetaylor @bcmills

@ianlancetaylor
Copy link
Member

It is not so crazy to use the signal context to figure out where a signal handler was invoked. That is what gdb does. The key point is to reliably determine whether you are running in a signal handler. I think gdb does that by recognizing the signal trampoline that is on the return stack, which resumes normal execution if the signal handler returns.

@aclements
Copy link
Member

The only place you could find it is the ctx argument to the signal handler (the thing we wrap a sigctxt around), and even that will be awkward to interpret in the case of a chain of signal handlers.

I don't think the chained signal handlers are a problem. We don't manipulate the context when calling down the chain.

I was thinking we might be able to stash this away in g.sched.sp when we enter the signal handler, but there's still a window between switching to the signal stack and doing there, and it's possible we can receive a signal while g.sched.sp is set to something important.

@ianlancetaylor
Copy link
Member

I don't think we need to do anything in dieFromSignal, I think that in crash we could clear the signal handler. At least at first glance I'm not worried about forwarding the signal once we are calling crash.

@bcmills
Copy link
Contributor

bcmills commented May 2, 2018

If we don't forward the signal, we potentially bypass crash reporters (particularly on iOS and Android) and skew people's crash metrics.

But I think I'm missing something here:

the thread is on the signal handler stack when it dies, not the goroutine stack

The saved base pointer in the root frame of the signal stack should point back to the goroutine stack, right? (Do we know why the debuggers aren't following that base pointer? Am I just confused?)

@heschi
Copy link
Contributor

heschi commented May 2, 2018

Since that was sort of addressed to me, I'll say that looking with GDB I think you're right that following the base pointer chain gets you to the right stack, and that seems better than using the signal handler context. Based on Stack Overflow I guess GDB only follows the base pointers when it's out of options.

@ianlancetaylor
Copy link
Member

Following the base pointer will presumably only work when the base pointer is used by the code being interupted, so it will be unreliable if the signal fires while executing C code compiled with -fomit-frame-pointer.

@aclements
Copy link
Member

I think gdb also won't follow a frame pointer if it points to a smaller address, to avoid walking corrupted stacks or going in cycles.

But why doesn't it unwind through the signal context? That I would expect to work here.

@aarzilli
Copy link
Contributor Author

aarzilli commented May 3, 2018

Maybe it's because you aren't telling it where you saved rdx on the stack?

@aarzilli
Copy link
Contributor Author

aarzilli commented May 3, 2018

Never mind, I see now that x86_64_fallback_frame_state always knows how to get the context. My next guess is that, if gdb's documentation is correct, x86_64_fallback_frame_state is only used when no FDE is found, which is not the case for runtime.sigreturn.
This was also the problem for delve: if I ignore runtime.sigreturn's FDE it will just follow the frame pointer and end up back on the right stack.

@heschi
Copy link
Contributor

heschi commented May 25, 2018

@aarzilli, did you want anything else here?

@aarzilli
Copy link
Contributor Author

I'm satisfied by blacklisting runtime.sigreturn's FDE on delve's side, I've left this issue open because I suspect it would be useful to other debuggers to remove it entirely from the executable.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/479096 mentions this issue: runtime: add "sigaction" to sigreturn symbol name

@prattmic
Copy link
Member

With https://go.dev/cl/479096, backtrace from a core looks like so:

(gdb) bt
#0  runtime.raise () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:154
#1  0x000000000043ecc5 in runtime.dieFromSignal (sig=6) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:879
#2  0x000000000043f1de in runtime.sigfwdgo (sig=6, info=<optimized out>, ctx=<optimized out>, ~r0=<optimized out>) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:1092
#3  0x000000000043da45 in runtime.sigtrampgo (sig=0, info=0xffffffffffffffff, ctx=0x453981 <runtime.raise+33>) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:432
#4  0x0000000000453c66 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
#5  <signal handler called>
#6  runtime.raise () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:154
#7  0x000000000043ecc5 in runtime.dieFromSignal (sig=6) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:879
#8  0x000000000043e787 in runtime.crash () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:971
#9  runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0xc0001021a0) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:793
#10 0x000000000043dbab in runtime.sigtrampgo (sig=3, info=0xc00010b4b0, ctx=0xc00010b380) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:479
#11 0x0000000000453c66 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
#12 <signal handler called>
#13 runtime.futex () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:557
#14 0x0000000000427af0 in runtime.futexsleep (addr=0xca, val=0, ns=4538115) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/os_linux.go:69
#15 0x00000000004087a7 in runtime.notesleep (n=0xc000100148) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/lock_futex.go:160
#16 0x000000000043164c in runtime.mPark () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2323
#17 runtime.stopm () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2323
#18 0x0000000000432e3c in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2996
#19 0x0000000000433bd1 in runtime.schedule () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:3349
#20 0x000000000043016d in runtime.mstart1 () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1494
#21 0x0000000000430076 in runtime.mstart0 () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1444
#22 0x0000000000451d25 in runtime.mstart () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/asm_amd64.s:394
#23 0x0000000000453faf in runtime.clone () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:618
#24 0x0000000000000000 in ?? ()

@prattmic
Copy link
Member

@aarzilli based on the prior comment, I'm not sure if you expect delve to work better or not, but it (mostly?) does not.

$ dlv version
Delve Debugger
Version: 1.20.1
Build: $Id: 96e65b6c615845d42e0e31d903f6475b0e4ece6e $

Interestingly, the main thread works ok (except 2 frame 15s?):

$ dlv core example.com core
 0  0x0000000000453d01 in runtime.raise
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:154
 1  0x000000000043ed05 in runtime.dieFromSignal
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:879
 2  0x000000000043f21e in runtime.sigfwdgo
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:1092
 3  0x000000000043da85 in runtime.sigtrampgo
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:432
 4  0x0000000000453fe6 in runtime.sigtramp
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
 5  0x00000000004540e0 in runtime.sigreturn
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:466
 6  0x000000000043e7c7 in runtime.crash
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:971
 7  0x000000000043e7c7 in runtime.sighandler
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:793
 8  0x000000000043dbeb in runtime.sigtrampgo
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:479
 9  0x0000000000453fe6 in runtime.sigtramp
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
10  0x00000000004540e0 in runtime.sigreturn
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:466
11  0x00000000004087a7 in runtime.notesleep
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/lock_futex.go:160
12  0x000000000043168c in runtime.mPark
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1519
13  0x000000000043168c in runtime.stopm
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2323
14  0x0000000000432e7c in runtime.findRunnable
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2996
15  0x0000000000433c11 in runtime.schedule
15  0x0000000000433c11 in runtime.schedule
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:3349
16  0x00000000004301ad in runtime.mstart1
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1494
17  0x00000000004300b6 in runtime.mstart0
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1444
18  0x00000000004520a5 in runtime.mstart
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/asm_amd64.s:395
19  0x000000000045432f in runtime.clone
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:613

But other threads don't:

(dlv) thread 1812158
Switched from 1812159 to 1812158
(dlv) bt
0  0x0000000000453937 in runtime.usleep
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:135
1  0x000000000042c459 in runtime.asyncPreempt2
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/preempt.go:307
   error: error while reading spliced memory at 0x18: EOF
(truncated)

This seems to be using mp.curg.sched? I assume that is expected? Simple mode gets stuck at sigreturn.

(dlv) bt -mode simple                                                                                                                         
0  0x0000000000453cb7 in runtime.usleep       
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:135
1  0x000000000043e7af in runtime.sighandler
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:791
2  0x000000000043dbeb in runtime.sigtrampgo                                                                                                   
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:479
3  0x0000000000453fe6 in runtime.sigtramp                                                                                                     
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
4  0x00000000004540e0 in runtime.sigreturn                                                                                                    
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:466
5  0x0000000000452541 in runtime.goexit                                                                                                       
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/asm_amd64.s:1619

After https://go.dev/cl/479096, behavior is mostly the same, except it chokes a bit more on sigreturn__sigaction:

(dlv) bt -mode simple
0  0x0000000000453937 in runtime.usleep
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:135
1  0x000000000043e76f in runtime.sighandler
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:791
2  0x000000000043dbab in runtime.sigtrampgo
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:479
3  0x0000000000453c66 in runtime.sigtramp
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
4  0x0000000000453d60 in runtime.sigreturn__sigaction
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:471
5  0x0000000000000007 in ???
   at ?:-1
   error: error while reading spliced memory at 0x8: EOF

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/479557 mentions this issue: buildlet: enable core dump creation on container VMs

@prattmic
Copy link
Member

https://go.dev/cl/479096 fixed 386 and amd64, but all other architectures I've checked (arm64, mips64, riscv) are still broken (Silent skips mask some of these failures from the dashboard). Reopening.

These other architectures have sigreturn defined in the VDSO, and GDB does seem to have appropriate matching logic for the VDSO, so I suspect we aren't handling the frame correctly in runtime.sigtramp, making GDB unable to find the return address to the VDSO.

@prattmic prattmic reopened this Mar 27, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/479518 mentions this issue: runtime: skip TestGdbCoreSignalBacktrace on arches other than 386/amd64

gopherbot pushed a commit that referenced this issue Mar 27, 2023
For #25218.

Change-Id: I4024a2064e0f56755fe40eb7489ba28eb4358c60
Reviewed-on: https://go-review.googlesource.com/c/go/+/479518
Run-TryBot: Michael Pratt <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
Auto-Submit: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
@aarzilli
Copy link
Contributor Author

@prattmic I'm looking into fixing dlv's stacktraces on go1.21 given the changes that have been made. As far as gdb can generate a good stacktrace assume it's a delve problem. However I don't understand the stacktraces you have posted, how was the core file generated?

@aarzilli
Copy link
Contributor Author

PS. sorry for the delay in responding to this and not saying anything in the CL, I didn't have time last week.

@prattmic
Copy link
Member

You can use a simplified program from the test I added:

package main

import (
        "fmt"
        "os"
        "runtime/debug"
        "syscall"
)

func enableCore() {
        debug.SetTraceback("crash")

        var lim syscall.Rlimit
        err := syscall.Getrlimit(syscall.RLIMIT_CORE, &lim)
        if err != nil {
                panic(fmt.Sprintf("error getting rlimit: %v", err))
        }
        lim.Cur = lim.Max
        fmt.Fprintf(os.Stderr, "Setting RLIMIT_CORE = %+#v\n", lim)
        err = syscall.Setrlimit(syscall.RLIMIT_CORE, &lim)
        if err != nil {
                panic(fmt.Sprintf("error setting rlimit: %v", err))
        }
}

func main() {
        enableCore()

        for {}
}
$ go build -o example
$ ./example
$ pkill -QUIT example # from separate shell
$ gdb example core
(gdb) bt
#0  runtime.raise () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:154
#1  0x0000000000442a05 in runtime.dieFromSignal (sig=6) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:894
#2  0x0000000000442f26 in runtime.sigfwdgo (sig=6, info=<optimized out>, ctx=<optimized out>, ~r0=<optimized out>) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:1107
#3  0x0000000000441725 in runtime.sigtrampgo (sig=0, info=0xffffffffffffffff, ctx=0x45b3c1 <runtime.raise+33>) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:432
#4  0x000000000045b6a6 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:352
#5  <signal handler called>
#6  runtime.raise () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:154
#7  0x0000000000442a05 in runtime.dieFromSignal (sig=6) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:894
#8  0x00000000004424c7 in runtime.crash () at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:986
#9  runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0xc0000821a0) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:808
#10 0x000000000044182e in runtime.sigtrampgo (sig=3, info=0xc00008b4b0, ctx=0xc00008b380) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:490
#11 0x000000000045b6a6 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:352
#12 <signal handler called>
#13 runtime.futex () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:557
#14 0x000000000042b2d0 in runtime.futexsleep (addr=0xca, val=0, ns=4569411) at /usr/local/google/home/mpratt/src/go/src/runtime/os_linux.go:69
#15 0x00000000004094e7 in runtime.notesleep (n=0xc000080148) at /usr/local/google/home/mpratt/src/go/src/runtime/lock_futex.go:160
#16 0x00000000004352cc in runtime.mPark () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391
#17 runtime.stopm () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391
#18 0x0000000000436abc in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3064
#19 0x0000000000437851 in runtime.schedule () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3417
#20 0x0000000000437d19 in runtime.park_m (gp=0xc0000071e0) at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3568
#21 0x0000000000457d6a in runtime.mcall () at /usr/local/google/home/mpratt/src/go/src/runtime/asm_amd64.s:455
#22 0x0000000000000000 in ?? ()

and other threads:

(gdb) thread apply all bt                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                            
Thread 5 (LWP 3496202):                                                                                                                                                                                                                                                                     
#0  runtime.usleep () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:135                                                                                                                                                                                             
#1  0x00000000004424af in runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0xc0000061a0) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:806                                                                                      
#2  0x000000000044182e in runtime.sigtrampgo (sig=3, info=0xc00000f4b0, ctx=0xc00000f380) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:490                                                                                                                            
#3  0x000000000045b6a6 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:352                                                                                                                                                                     
#4  <signal handler called>                                                                                                                                                                                                                                                                 
#5  main.main () at /tmp/core/main.go:29                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                            
Thread 4 (LWP 3496204):                                                                                                                                                                                                                                                                     
#0  runtime.usleep () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:135                                                                                                                                                                                             
#1  0x00000000004424af in runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0xc0000069c0) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:806                                                                                      
#2  0x000000000044182e in runtime.sigtrampgo (sig=3, info=0xc00006b4b0, ctx=0xc00006b380) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:490                                                                                                                            
#3  0x000000000045b6a6 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:352                                                                                                                                                                     
#4  <signal handler called>                                                                                                                                                                                                                                                                 
#5  runtime.futex () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:557
#6  0x000000000042b2d0 in runtime.futexsleep (addr=0xca, val=0, ns=4569411) at /usr/local/google/home/mpratt/src/go/src/runtime/os_linux.go:69
#7  0x00000000004094e7 in runtime.notesleep (n=0xc000052548) at /usr/local/google/home/mpratt/src/go/src/runtime/lock_futex.go:160
#8  0x00000000004352cc in runtime.mPark () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391
#9  runtime.stopm () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391
#10 0x0000000000436abc in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3064
#11 0x0000000000437851 in runtime.schedule () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3417
#12 0x0000000000437d19 in runtime.park_m (gp=0xc000006680) at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3568
#13 0x0000000000457d6a in runtime.mcall () at /usr/local/google/home/mpratt/src/go/src/runtime/asm_amd64.s:455
#14 0x0000000000000000 in ?? ()

... more ...

There is a lot going on here. Some of the highlights:

  • All threads are in a signal handler, not just one. Most in a SIGQUIT (sig=3) handler.
  • The one that actually crashed has two signal handlers. First SIGQUIT (sig=3, frame 10), then SIGABRT (sig=6, frame 2).

The reason that they are all in SIGQUIT handlers is that we send SIGQUIT to every thread before crashing. This is used to print the per-thread stack dump at the end of the panic output:

Thread dump example
-----                                                                                                                                         

SIGQUIT: quit
PC=0x45b941 m=3 sigcode=0

goroutine 0 [idle]:
runtime.futex()
/usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:557 +0x21 fp=0xc00005de78 sp=0xc00005de70 pc=0x45b941
runtime.futexsleep(0x0?, 0x0?, 0x1?)
/usr/local/google/home/mpratt/src/go/src/runtime/os_linux.go:69 +0x30 fp=0xc00005dec8 sp=0xc00005de78 pc=0x42b2d0
runtime.notesleep(0xc000052948)
/usr/local/google/home/mpratt/src/go/src/runtime/lock_futex.go:160 +0x87 fp=0xc00005df00 sp=0xc00005dec8 pc=0x4094e7
runtime.mPark(...)
/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:1530
runtime.stopm()
/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391 +0x8c fp=0xc00005df30 sp=0xc00005df00 pc=0x4352cc
runtime.startlockedm(0x100000004?)
/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2646 +0x6a fp=0xc00005df58 sp=0xc00005df30 pc=0x435d6a
runtime.schedule()
/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3451 +0x6a fp=0xc00005df90 sp=0xc00005df58 pc=0x43780a
runtime.park_m(0xc000006b60?)
/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3568 +0x119 fp=0xc00005dfc0 sp=0xc00005df90 pc=0x437d19
runtime.mcall()
/usr/local/google/home/mpratt/src/go/src/runtime/asm_amd64.s:455 +0x4a fp=0xc00005dfd8 sp=0xc00005dfc0 pc=0x457d6a
rax 0xca
rbx 0x0
rcx 0x45b943
rdx 0x0
rdi 0xc000052948
rsi 0x80
rbp 0xc00005deb8
rsp 0xc00005de70
r8 0x0
r9 0x0
r10 0x0
r11 0x286
r12 0xc00005ded0
r13 0x0
r14 0xc000007040
r15 0xa
rip 0x45b941
rflags 0x286
cs 0x33
fs 0x0
gs 0x0

Once the last thread gets SIGQUIT, crash -> dieFromSignal tries to die from SIGABRT. This is a two part process. Rather than simply removing the SIGABRT handler, we just mark Go as not wanting the signal and send SIGABRT. This gets us in a nested signal handler. Since there is no C signal handler to forward to, we finally remove the signal handler and send SIGABRT again, this time actually dying.

(FWIW, this is a bit overly complex; the first dieFromSignal could tell that there is nowhere to forward the signal to and just die directly)

@aarzilli
Copy link
Contributor Author

It doesn't look like, in this case, gdb's output is very good: none of the threads have a frame with main.main, which you would expect to see. I'm getting the same with delve with my local improvements btw.

gopherbot pushed a commit to golang/build that referenced this issue Mar 28, 2023
Container-optimized OS sets kernel.core_pattern = "|/bin/false",
effectively disabling core dump creation regardless of RLIMIT_CORE.

We have tests that want to analyze core dumps, so reset core_pattern
back to the default value on boot.

For golang/go#25218.

Change-Id: I7e3cc7496a5428326855cc687b87cb4da76fdd66
Reviewed-on: https://go-review.googlesource.com/c/build/+/479557
Run-TryBot: Michael Pratt <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
@aarzilli
Copy link
Contributor Author

Presumably this is because the goroutine that executes main.main isn't running on any thread, but if I stacktrace goroutine 1 with delve I get this:

(dlv) goroutine 1 stack
0  0x000000000042fd59 in runtime.asyncPreempt2
   at /usr/local/go/src/runtime/preempt.go:309
   error: error while reading spliced memory at 0x18: EOF

which is still wrong, but not obviously delve's fault.

@prattmic
Copy link
Member

That can be expected. sighandler -> startpanic_m -> freezetheworld does a best-effort STW prior to the panic. So main.main will be descheduled just before the crash. The idea behind this is so that the panic traceback and print all goroutines (we can do traceback of a running goroutine).

I actually just merged https://go.dev/cl/478975 to optionally disable this because it makes debugging the scheduler more painful when panic scribbles all over scheduler state.

In #25218 (comment), thread 5 actually does show main.main. I suspect this is because that thread happened to be the one to receive the original SIGQUIT.

@aarzilli
Copy link
Contributor Author

@prattmic uh, you're right, I didn't see it. I have the same thing in my core file and delve does the same thing as gdb. I was expecting main.main to be on a thread that's running a goroutine, but it isn't. The state of the process is pretty weird when taking a core dump like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Debugging NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants