runtime: can not read stacktrace from a core file #25218

aarzilli · 2018-05-02T11:45:02Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

This problem was introduced by b1d1ec9 (CL 110065) and is still present at tip.

Does this issue reproduce with the latest release?

No.

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/a/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/a/n/go/"
GORACE=""
GOROOT="/usr/local/go-tip"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go-tip/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build495464568=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Given:

package main

import "runtime/debug"

func main() {
	debug.SetTraceback("crash")
	crash()
}

func crash() {
	panic("panic!")
}

Running it under gdb will produce this stacktrace:

(gdb) bt
#0  0x000000000044fff4 in runtime.raise () at <autogenerated>:1
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at /usr/local/go-tip/src/runtime/signal_unix.go:424
#2  0x0000000000438dea in runtime.crash () at /usr/local/go-tip/src/runtime/signal_unix.go:526
#3  0x0000000000426487 in runtime.fatalpanic (msgs=<optimized out>) at /usr/local/go-tip/src/runtime/panic.go:696
#4  0x0000000000425e6b in runtime.gopanic (e=...) at /usr/local/go-tip/src/runtime/panic.go:502
#5  0x0000000000470ac9 in main.crash () at /home/a/temp/simple.go:11
#6  0x0000000000470a7b in main.main () at /home/a/temp/simple.go:7

however letting it produce a core file then reading the core file with gdb produces this:

$ gdb ./simple simple-core
...
(gdb) bt
#0  0x000000000044fff4 in runtime.raise () at <autogenerated>:1
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at /usr/local/go-tip/src/runtime/signal_unix.go:424
#2  0x00000000004390a8 in runtime.sigfwdgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=6, ~r3=<optimized out>) at /usr/local/go-tip/src/runtime/signal_unix.go:637
#3  0x0000000000438488 in runtime.sigtrampgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=<optimized out>) at /usr/local/go-tip/src/runtime/signal_unix.go:289
#4  0x00000000004502e3 in runtime.sigtramp () at <autogenerated>:1
#5  0x00000000004503d0 in ?? () at <autogenerated>:1
#6  0x0000000000000001 in ?? ()
#7  0x0000000000000000 in ?? ()

I'm not sure what's happening here. Is the signal handler running and overwriting part of the stack?

The text was updated successfully, but these errors were encountered:

aarzilli · 2018-05-02T11:45:57Z

cc @eliasnaur @aclements

heschi · 2018-05-02T18:11:03Z

I assume this is all with GOTRACEBACK=crash. GDB is breaking at the first SIGABRT, which isn't what actually kills the process:

(gdb) run
Starting program: panic 
panic: panic!
[snip]
Program received signal SIGABRT, Aborted.
runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
146		RET
(gdb) bt
#0  runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at go/src/runtime/signal_unix.go:424
#2  0x0000000000438dea in runtime.crash () at go/src/runtime/signal_unix.go:526
#3  0x0000000000426497 in runtime.fatalpanic (msgs=<optimized out>) at go/src/runtime/panic.go:696
#4  0x0000000000425e7b in runtime.gopanic (e=...) at go/src/runtime/panic.go:502
#5  0x00000000004709c9 in main.crash () at go/src/panic.go:11
#6  0x000000000047097b in main.main () at go/src/panic.go:7
(gdb) c
Continuing.

Program received signal SIGABRT, Aborted.
runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
146		RET
(gdb) bt
#0  runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at go/src/runtime/signal_unix.go:424
#2  0x00000000004390a8 in runtime.sigfwdgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=6, ~r3=<optimized out>) at go/src/runtime/signal_unix.go:637
#3  0x0000000000438488 in runtime.sigtrampgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=<optimized out>) at go/src/runtime/signal_unix.go:289
#4  0x00000000004502a3 in runtime.sigtramp () at go/src/runtime/sys_linux_amd64.s:353
#5  0x0000000000450390 in ?? () at go/src/runtime/sys_linux_amd64.s:437
#6  0x0000000000000007 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb) c
Continuing.

Program terminated with signal SIGABRT, Aborted.
The program no longer exists.

So that much makes sense. Is the real complaint here that the application's stack trace doesn't appear in the core dump?

aarzilli · 2018-05-02T18:34:34Z

So that much makes sense. Is the real complaint here that the application's stack trace doesn't appear in the core dump?

Yes. I would expect some goroutine's stack to have a frame for main.main and main.crash in it, at least.

heschi · 2018-05-02T18:59:41Z

Well, it would still be on the goroutine's stack, but the thread is on the signal handler stack when it dies, not the goroutine stack. And IIRC Austin said there's no way to convince GDB to show you a different stack for a thread while handling a core file without writing a custom backtracer and even then it was very hard or didn't work or something.

My inclination would be to deregister the signal handler for sig in dieFromSignal. If the first signal kills the process then you'd get the first stack in the core file, which is what we want. But all this stuff is incredibly subtle, so there's probably a reason that wouldn't work, maybe some cgo interop thing.

Does Delve do any better than GDB here? In principle it could show the goroutine stack along with gsignal.

aarzilli · 2018-05-02T19:19:03Z

Does Delve do any better than GDB here? In principle it could show the goroutine stack along with gsignal.

It used to do better, but it doesn't anymore after that commit. In particular:

Well, it would still be on the goroutine's stack, but the thread is on the signal handler stack when it dies, not the goroutine stack

doesn't look true to me. What it looks like to me is that the thread is running on the normal stack of goroutine 1, if it was a signal handling stack it would have goid == 0, right? Also, how does sigfwdgo call dieFromSignal?

heschi · 2018-05-02T19:38:21Z

Where are you seeing goroutine 1 that you trust? The panic backtrace shows goroutine 1, but that happens before dieFromSignal. gdb shows 1 in info goroutines, but it combines scheduler information with thread state in a way that'll just end up showing whatever the thread's up to, not the user code that was running for goroutine 1.

Also, how does sigfwdgo call dieFromSignal?

go/src/runtime/signal_unix.go

Line 637 in 28b40f3

dieFromSignal(sig)

or am I missing something?

I'm probably out of my depth at this point. Hopefully Austin or Elias can weigh in.

aarzilli · 2018-05-02T20:01:04Z

Where are you seeing goroutine 1 that you trust?

I'm seeing it while debugging delve, the TLS for the first thread contains a pointer to the same g struct that's in the first entry of runtime.allgs. And it has goid == 1.
I was under the impression that when a thread is running on one of those special stacks the pointer to g is changed accordingly, am I wrong? Should I check if sp is inside [g.m.gsignal.stack.stackhi, g.m.gsignal.stack.stacklo]?

or am I missing something?

oh I didn't see that.

heschi · 2018-05-02T20:11:29Z

I was under the impression that when a thread is running on one of those special stacks the pointer to g is changed accordingly, am I wrong? Should I check if sp is inside [g.m.gsignal.stack.stackhi, g.m.gsignal.stack.stacklo]?

If you want to handle this case I think you have to. This area of the code is responsible for handling signals that might be caused by C code, so it can't blindly muck with Go stuff until it knows what's up. The setg call you hoped had run is here:

go/src/runtime/signal_unix.go

Line 343 in 28b40f3

setg(g.m.gsignal)

and only runs if sigfwdgo doesn't do anything. sigtrampgo checks the stack pointer to decide if gsignal or g0 is running:

go/src/runtime/signal_unix.go

Lines 307 to 308 in 28b40f3

    
           if sp < g.m.gsignal.stack.lo || sp >= g.m.gsignal.stack.hi { 
        
           	if sp >= g.m.g0.stack.lo && sp < g.m.g0.stack.hi {

so that ought to be a decent model to follow.

(I still kinda think clearing out the signal handler in dieFromSignal is reasonable.)

aarzilli · 2018-05-02T20:36:07Z

Ok, the thread's sp is actually inside gsignal's stack. Do you know where the sp for the normal goroutine stack is saved in this case? g.sched.sp is zero.

heschi · 2018-05-02T20:44:43Z

That's a good point. The goroutine wasn't preempted normally, so nothing in the runtime will know what its stack pointer was. The only place you could find it is the ctx argument to the signal handler (the thing we wrap a sigctxt around), and even that will be awkward to interpret in the case of a chain of signal handlers.

aclements · 2018-05-02T20:46:12Z

I still kinda think clearing out the signal handler in dieFromSignal is reasonable.

I think the concern here is if the signal needs to be forwarded. E.g., if you have some SIGABRT-trapping crash handling service installed, we want the signal to get forwarded to it. Maybe dieFromSignal could check if we're forwarding it (which we never will be in pure Go programs) and, if not, go straight to its fallback path that clears the handler and raises?

cc @ianlancetaylor @bcmills

ianlancetaylor · 2018-05-02T20:48:44Z

It is not so crazy to use the signal context to figure out where a signal handler was invoked. That is what gdb does. The key point is to reliably determine whether you are running in a signal handler. I think gdb does that by recognizing the signal trampoline that is on the return stack, which resumes normal execution if the signal handler returns.

aclements · 2018-05-02T20:48:48Z

The only place you could find it is the ctx argument to the signal handler (the thing we wrap a sigctxt around), and even that will be awkward to interpret in the case of a chain of signal handlers.

I don't think the chained signal handlers are a problem. We don't manipulate the context when calling down the chain.

I was thinking we might be able to stash this away in g.sched.sp when we enter the signal handler, but there's still a window between switching to the signal stack and doing there, and it's possible we can receive a signal while g.sched.sp is set to something important.

ianlancetaylor · 2018-05-02T20:53:59Z

I don't think we need to do anything in dieFromSignal, I think that in crash we could clear the signal handler. At least at first glance I'm not worried about forwarding the signal once we are calling crash.

bcmills · 2018-05-02T21:10:55Z

If we don't forward the signal, we potentially bypass crash reporters (particularly on iOS and Android) and skew people's crash metrics.

But I think I'm missing something here:

the thread is on the signal handler stack when it dies, not the goroutine stack

The saved base pointer in the root frame of the signal stack should point back to the goroutine stack, right? (Do we know why the debuggers aren't following that base pointer? Am I just confused?)

heschi · 2018-05-02T23:30:17Z

Since that was sort of addressed to me, I'll say that looking with GDB I think you're right that following the base pointer chain gets you to the right stack, and that seems better than using the signal handler context. Based on Stack Overflow I guess GDB only follows the base pointers when it's out of options.

ianlancetaylor · 2018-05-02T23:54:32Z

Following the base pointer will presumably only work when the base pointer is used by the code being interupted, so it will be unreliable if the signal fires while executing C code compiled with -fomit-frame-pointer.

aclements · 2018-05-03T02:38:53Z

I think gdb also won't follow a frame pointer if it points to a smaller address, to avoid walking corrupted stacks or going in cycles.

But why doesn't it unwind through the signal context? That I would expect to work here.

aarzilli · 2018-05-03T05:16:09Z

Maybe it's because you aren't telling it where you saved rdx on the stack?

aarzilli · 2018-05-03T06:28:35Z

Never mind, I see now that x86_64_fallback_frame_state always knows how to get the context. My next guess is that, if gdb's documentation is correct, x86_64_fallback_frame_state is only used when no FDE is found, which is not the case for runtime.sigreturn.
This was also the problem for delve: if I ignore runtime.sigreturn's FDE it will just follow the frame pointer and end up back on the right stack.

heschi · 2018-05-25T19:14:01Z

@aarzilli, did you want anything else here?

aarzilli · 2018-05-25T19:19:01Z

I'm satisfied by blacklisting runtime.sigreturn's FDE on delve's side, I've left this issue open because I suspect it would be useful to other debuggers to remove it entirely from the executable.

gopherbot · 2023-03-23T21:12:57Z

Change https://go.dev/cl/479096 mentions this issue: runtime: add "sigaction" to sigreturn symbol name

prattmic · 2023-03-23T21:14:22Z

With https://go.dev/cl/479096, backtrace from a core looks like so:

(gdb) bt
#0  runtime.raise () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:154
#1  0x000000000043ecc5 in runtime.dieFromSignal (sig=6) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:879
#2  0x000000000043f1de in runtime.sigfwdgo (sig=6, info=<optimized out>, ctx=<optimized out>, ~r0=<optimized out>) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:1092
#3  0x000000000043da45 in runtime.sigtrampgo (sig=0, info=0xffffffffffffffff, ctx=0x453981 <runtime.raise+33>) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:432
#4  0x0000000000453c66 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
#5  <signal handler called>
#6  runtime.raise () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:154
#7  0x000000000043ecc5 in runtime.dieFromSignal (sig=6) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:879
#8  0x000000000043e787 in runtime.crash () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:971
#9  runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0xc0001021a0) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:793
#10 0x000000000043dbab in runtime.sigtrampgo (sig=3, info=0xc00010b4b0, ctx=0xc00010b380) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:479
#11 0x0000000000453c66 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
#12 <signal handler called>
#13 runtime.futex () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:557
#14 0x0000000000427af0 in runtime.futexsleep (addr=0xca, val=0, ns=4538115) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/os_linux.go:69
#15 0x00000000004087a7 in runtime.notesleep (n=0xc000100148) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/lock_futex.go:160
#16 0x000000000043164c in runtime.mPark () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2323
#17 runtime.stopm () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2323
#18 0x0000000000432e3c in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2996
#19 0x0000000000433bd1 in runtime.schedule () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:3349
#20 0x000000000043016d in runtime.mstart1 () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1494
#21 0x0000000000430076 in runtime.mstart0 () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1444
#22 0x0000000000451d25 in runtime.mstart () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/asm_amd64.s:394
#23 0x0000000000453faf in runtime.clone () at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:618
#24 0x0000000000000000 in ?? ()

prattmic · 2023-03-23T21:30:51Z

@aarzilli based on the prior comment, I'm not sure if you expect delve to work better or not, but it (mostly?) does not.

$ dlv version
Delve Debugger
Version: 1.20.1
Build: $Id: 96e65b6c615845d42e0e31d903f6475b0e4ece6e $

Interestingly, the main thread works ok (except 2 frame 15s?):

$ dlv core example.com core
 0  0x0000000000453d01 in runtime.raise
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:154
 1  0x000000000043ed05 in runtime.dieFromSignal
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:879
 2  0x000000000043f21e in runtime.sigfwdgo
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:1092
 3  0x000000000043da85 in runtime.sigtrampgo
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:432
 4  0x0000000000453fe6 in runtime.sigtramp
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
 5  0x00000000004540e0 in runtime.sigreturn
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:466
 6  0x000000000043e7c7 in runtime.crash
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:971
 7  0x000000000043e7c7 in runtime.sighandler
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:793
 8  0x000000000043dbeb in runtime.sigtrampgo
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:479
 9  0x0000000000453fe6 in runtime.sigtramp
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
10  0x00000000004540e0 in runtime.sigreturn
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:466
11  0x00000000004087a7 in runtime.notesleep
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/lock_futex.go:160
12  0x000000000043168c in runtime.mPark
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1519
13  0x000000000043168c in runtime.stopm
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2323
14  0x0000000000432e7c in runtime.findRunnable
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:2996
15  0x0000000000433c11 in runtime.schedule
15  0x0000000000433c11 in runtime.schedule
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:3349
16  0x00000000004301ad in runtime.mstart1
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1494
17  0x00000000004300b6 in runtime.mstart0
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/proc.go:1444
18  0x00000000004520a5 in runtime.mstart
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/asm_amd64.s:395
19  0x000000000045432f in runtime.clone
    at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:613

But other threads don't:

(dlv) thread 1812158
Switched from 1812159 to 1812158
(dlv) bt
0  0x0000000000453937 in runtime.usleep
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:135
1  0x000000000042c459 in runtime.asyncPreempt2
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/preempt.go:307
   error: error while reading spliced memory at 0x18: EOF
(truncated)

This seems to be using mp.curg.sched? I assume that is expected? Simple mode gets stuck at sigreturn.

(dlv) bt -mode simple                                                                                                                         
0  0x0000000000453cb7 in runtime.usleep       
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:135
1  0x000000000043e7af in runtime.sighandler
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:791
2  0x000000000043dbeb in runtime.sigtrampgo                                                                                                   
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:479
3  0x0000000000453fe6 in runtime.sigtramp                                                                                                     
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
4  0x00000000004540e0 in runtime.sigreturn                                                                                                    
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:466
5  0x0000000000452541 in runtime.goexit                                                                                                       
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/asm_amd64.s:1619

After https://go.dev/cl/479096, behavior is mostly the same, except it chokes a bit more on sigreturn__sigaction:

(dlv) bt -mode simple
0  0x0000000000453937 in runtime.usleep
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:135
1  0x000000000043e76f in runtime.sighandler
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:791
2  0x000000000043dbab in runtime.sigtrampgo
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/signal_unix.go:479
3  0x0000000000453c66 in runtime.sigtramp
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:352
4  0x0000000000453d60 in runtime.sigreturn__sigaction
   at /usr/local/google/home/mpratt/src/go-experiment/src/runtime/sys_linux_amd64.s:471
5  0x0000000000000007 in ???
   at ?:-1
   error: error while reading spliced memory at 0x8: EOF

gopherbot · 2023-03-27T19:11:49Z

Change https://go.dev/cl/479557 mentions this issue: buildlet: enable core dump creation on container VMs

prattmic · 2023-03-27T19:14:14Z

https://go.dev/cl/479096 fixed 386 and amd64, but all other architectures I've checked (arm64, mips64, riscv) are still broken (Silent skips mask some of these failures from the dashboard). Reopening.

These other architectures have sigreturn defined in the VDSO, and GDB does seem to have appropriate matching logic for the VDSO, so I suspect we aren't handling the frame correctly in runtime.sigtramp, making GDB unable to find the return address to the VDSO.

gopherbot · 2023-03-27T19:23:32Z

Change https://go.dev/cl/479518 mentions this issue: runtime: skip TestGdbCoreSignalBacktrace on arches other than 386/amd64

For #25218. Change-Id: I4024a2064e0f56755fe40eb7489ba28eb4358c60 Reviewed-on: https://go-review.googlesource.com/c/go/+/479518 Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Cherry Mui <[email protected]> Auto-Submit: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

aarzilli · 2023-03-28T14:50:14Z

@prattmic I'm looking into fixing dlv's stacktraces on go1.21 given the changes that have been made. As far as gdb can generate a good stacktrace assume it's a delve problem. However I don't understand the stacktraces you have posted, how was the core file generated?

aarzilli · 2023-03-28T14:53:00Z

PS. sorry for the delay in responding to this and not saying anything in the CL, I didn't have time last week.

prattmic · 2023-03-28T17:07:50Z

You can use a simplified program from the test I added:

package main

import (
        "fmt"
        "os"
        "runtime/debug"
        "syscall"
)

func enableCore() {
        debug.SetTraceback("crash")

        var lim syscall.Rlimit
        err := syscall.Getrlimit(syscall.RLIMIT_CORE, &lim)
        if err != nil {
                panic(fmt.Sprintf("error getting rlimit: %v", err))
        }
        lim.Cur = lim.Max
        fmt.Fprintf(os.Stderr, "Setting RLIMIT_CORE = %+#v\n", lim)
        err = syscall.Setrlimit(syscall.RLIMIT_CORE, &lim)
        if err != nil {
                panic(fmt.Sprintf("error setting rlimit: %v", err))
        }
}

func main() {
        enableCore()

        for {}
}

$ go build -o example
$ ./example
$ pkill -QUIT example # from separate shell
$ gdb example core
(gdb) bt
#0  runtime.raise () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:154
#1  0x0000000000442a05 in runtime.dieFromSignal (sig=6) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:894
#2  0x0000000000442f26 in runtime.sigfwdgo (sig=6, info=<optimized out>, ctx=<optimized out>, ~r0=<optimized out>) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:1107
#3  0x0000000000441725 in runtime.sigtrampgo (sig=0, info=0xffffffffffffffff, ctx=0x45b3c1 <runtime.raise+33>) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:432
#4  0x000000000045b6a6 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:352
#5  <signal handler called>
#6  runtime.raise () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:154
#7  0x0000000000442a05 in runtime.dieFromSignal (sig=6) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:894
#8  0x00000000004424c7 in runtime.crash () at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:986
#9  runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0xc0000821a0) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:808
#10 0x000000000044182e in runtime.sigtrampgo (sig=3, info=0xc00008b4b0, ctx=0xc00008b380) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:490
#11 0x000000000045b6a6 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:352
#12 <signal handler called>
#13 runtime.futex () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:557
#14 0x000000000042b2d0 in runtime.futexsleep (addr=0xca, val=0, ns=4569411) at /usr/local/google/home/mpratt/src/go/src/runtime/os_linux.go:69
#15 0x00000000004094e7 in runtime.notesleep (n=0xc000080148) at /usr/local/google/home/mpratt/src/go/src/runtime/lock_futex.go:160
#16 0x00000000004352cc in runtime.mPark () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391
#17 runtime.stopm () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391
#18 0x0000000000436abc in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3064
#19 0x0000000000437851 in runtime.schedule () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3417
#20 0x0000000000437d19 in runtime.park_m (gp=0xc0000071e0) at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3568
#21 0x0000000000457d6a in runtime.mcall () at /usr/local/google/home/mpratt/src/go/src/runtime/asm_amd64.s:455
#22 0x0000000000000000 in ?? ()

and other threads:

(gdb) thread apply all bt                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                            
Thread 5 (LWP 3496202):                                                                                                                                                                                                                                                                     
#0  runtime.usleep () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:135                                                                                                                                                                                             
#1  0x00000000004424af in runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0xc0000061a0) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:806                                                                                      
#2  0x000000000044182e in runtime.sigtrampgo (sig=3, info=0xc00000f4b0, ctx=0xc00000f380) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:490                                                                                                                            
#3  0x000000000045b6a6 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:352                                                                                                                                                                     
#4  <signal handler called>                                                                                                                                                                                                                                                                 
#5  main.main () at /tmp/core/main.go:29                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                            
Thread 4 (LWP 3496204):                                                                                                                                                                                                                                                                     
#0  runtime.usleep () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:135                                                                                                                                                                                             
#1  0x00000000004424af in runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0xc0000069c0) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:806                                                                                      
#2  0x000000000044182e in runtime.sigtrampgo (sig=3, info=0xc00006b4b0, ctx=0xc00006b380) at /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:490                                                                                                                            
#3  0x000000000045b6a6 in runtime.sigtramp () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:352                                                                                                                                                                     
#4  <signal handler called>                                                                                                                                                                                                                                                                 
#5  runtime.futex () at /usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:557
#6  0x000000000042b2d0 in runtime.futexsleep (addr=0xca, val=0, ns=4569411) at /usr/local/google/home/mpratt/src/go/src/runtime/os_linux.go:69
#7  0x00000000004094e7 in runtime.notesleep (n=0xc000052548) at /usr/local/google/home/mpratt/src/go/src/runtime/lock_futex.go:160
#8  0x00000000004352cc in runtime.mPark () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391
#9  runtime.stopm () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391
#10 0x0000000000436abc in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3064
#11 0x0000000000437851 in runtime.schedule () at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3417
#12 0x0000000000437d19 in runtime.park_m (gp=0xc000006680) at /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3568
#13 0x0000000000457d6a in runtime.mcall () at /usr/local/google/home/mpratt/src/go/src/runtime/asm_amd64.s:455
#14 0x0000000000000000 in ?? ()

... more ...

There is a lot going on here. Some of the highlights:

All threads are in a signal handler, not just one. Most in a SIGQUIT (sig=3) handler.
The one that actually crashed has two signal handlers. First SIGQUIT (sig=3, frame 10), then SIGABRT (sig=6, frame 2).

The reason that they are all in SIGQUIT handlers is that we send SIGQUIT to every thread before crashing. This is used to print the per-thread stack dump at the end of the panic output:

Thread dump example

-----                                                                                                                                         
SIGQUIT: quit

PC=0x45b941 m=3 sigcode=0
goroutine 0 [idle]:

runtime.futex()

/usr/local/google/home/mpratt/src/go/src/runtime/sys_linux_amd64.s:557 +0x21 fp=0xc00005de78 sp=0xc00005de70 pc=0x45b941

runtime.futexsleep(0x0?, 0x0?, 0x1?)

/usr/local/google/home/mpratt/src/go/src/runtime/os_linux.go:69 +0x30 fp=0xc00005dec8 sp=0xc00005de78 pc=0x42b2d0

runtime.notesleep(0xc000052948)

/usr/local/google/home/mpratt/src/go/src/runtime/lock_futex.go:160 +0x87 fp=0xc00005df00 sp=0xc00005dec8 pc=0x4094e7

runtime.mPark(...)

/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:1530

runtime.stopm()

/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2391 +0x8c fp=0xc00005df30 sp=0xc00005df00 pc=0x4352cc

runtime.startlockedm(0x100000004?)

/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:2646 +0x6a fp=0xc00005df58 sp=0xc00005df30 pc=0x435d6a

runtime.schedule()

/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3451 +0x6a fp=0xc00005df90 sp=0xc00005df58 pc=0x43780a

runtime.park_m(0xc000006b60?)

/usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3568 +0x119 fp=0xc00005dfc0 sp=0xc00005df90 pc=0x437d19

runtime.mcall()

/usr/local/google/home/mpratt/src/go/src/runtime/asm_amd64.s:455 +0x4a fp=0xc00005dfd8 sp=0xc00005dfc0 pc=0x457d6a

rax    0xca

rbx    0x0

rcx    0x45b943

rdx    0x0

rdi    0xc000052948

rsi    0x80

rbp    0xc00005deb8

rsp    0xc00005de70

r8     0x0

r9     0x0

r10    0x0

r11    0x286

r12    0xc00005ded0

r13    0x0

r14    0xc000007040

r15    0xa

rip    0x45b941

rflags 0x286

cs     0x33

fs     0x0

gs     0x0

Once the last thread gets SIGQUIT, crash -> dieFromSignal tries to die from SIGABRT. This is a two part process. Rather than simply removing the SIGABRT handler, we just mark Go as not wanting the signal and send SIGABRT. This gets us in a nested signal handler. Since there is no C signal handler to forward to, we finally remove the signal handler and send SIGABRT again, this time actually dying.

(FWIW, this is a bit overly complex; the first dieFromSignal could tell that there is nowhere to forward the signal to and just die directly)

aarzilli · 2023-03-28T17:46:28Z

It doesn't look like, in this case, gdb's output is very good: none of the threads have a frame with main.main, which you would expect to see. I'm getting the same with delve with my local improvements btw.

Container-optimized OS sets kernel.core_pattern = "|/bin/false", effectively disabling core dump creation regardless of RLIMIT_CORE. We have tests that want to analyze core dumps, so reset core_pattern back to the default value on boot. For golang/go#25218. Change-Id: I7e3cc7496a5428326855cc687b87cb4da76fdd66 Reviewed-on: https://go-review.googlesource.com/c/build/+/479557 Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Heschi Kreinick <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

aarzilli · 2023-03-28T17:52:07Z

Presumably this is because the goroutine that executes main.main isn't running on any thread, but if I stacktrace goroutine 1 with delve I get this:

(dlv) goroutine 1 stack
0  0x000000000042fd59 in runtime.asyncPreempt2
   at /usr/local/go/src/runtime/preempt.go:309
   error: error while reading spliced memory at 0x18: EOF

which is still wrong, but not obviously delve's fault.

prattmic · 2023-03-28T17:59:37Z

That can be expected. sighandler -> startpanic_m -> freezetheworld does a best-effort STW prior to the panic. So main.main will be descheduled just before the crash. The idea behind this is so that the panic traceback and print all goroutines (we can do traceback of a running goroutine).

I actually just merged https://go.dev/cl/478975 to optionally disable this because it makes debugging the scheduler more painful when panic scribbles all over scheduler state.

In #25218 (comment), thread 5 actually does show main.main. I suspect this is because that thread happened to be the one to receive the original SIGQUIT.

aarzilli · 2023-03-28T18:08:41Z

@prattmic uh, you're right, I didn't see it. I have the same thing in my core file and delve does the same thing as gdb. I was expecting main.main to be on a thread that's running a goroutine, but it isn't. The state of the process is pretty weird when taking a core dump like this.

For golang#25218. Change-Id: I4024a2064e0f56755fe40eb7489ba28eb4358c60 Reviewed-on: https://go-review.googlesource.com/c/go/+/479518 Run-TryBot: Michael Pratt <[email protected]> Reviewed-by: Cherry Mui <[email protected]> Auto-Submit: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 2, 2018

andybons added this to the Unplanned milestone May 2, 2018

heschi added the Debugging label May 2, 2018

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

mknyszek added this to Go Compiler / Runtime Jul 7, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

gopherbot closed this as completed in 0ed364a Mar 24, 2023

prattmic reopened this Mar 27, 2023

This was referenced Oct 14, 2024

runtime: provide a way to opt in to stack trace printing for process-terminating signals in secure-mode binaries on Unix OSes #69868

Open

runtime: stack break at s file make me can not find out error source #69920

Closed

runtime: can not read stacktrace from a core file #25218

runtime: can not read stacktrace from a core file #25218

Comments

aarzilli commented May 2, 2018

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

aarzilli commented May 2, 2018

heschi commented May 2, 2018

aarzilli commented May 2, 2018

heschi commented May 2, 2018

aarzilli commented May 2, 2018

heschi commented May 2, 2018

aarzilli commented May 2, 2018

heschi commented May 2, 2018 • edited Loading

aarzilli commented May 2, 2018

heschi commented May 2, 2018

aclements commented May 2, 2018

ianlancetaylor commented May 2, 2018

aclements commented May 2, 2018

ianlancetaylor commented May 2, 2018

bcmills commented May 2, 2018

heschi commented May 2, 2018

ianlancetaylor commented May 2, 2018

aclements commented May 3, 2018

aarzilli commented May 3, 2018

aarzilli commented May 3, 2018

heschi commented May 25, 2018

aarzilli commented May 25, 2018

gopherbot commented Mar 23, 2023

prattmic commented Mar 23, 2023

prattmic commented Mar 23, 2023

gopherbot commented Mar 27, 2023

prattmic commented Mar 27, 2023

gopherbot commented Mar 27, 2023

aarzilli commented Mar 28, 2023

aarzilli commented Mar 28, 2023

prattmic commented Mar 28, 2023

aarzilli commented Mar 28, 2023

aarzilli commented Mar 28, 2023

prattmic commented Mar 28, 2023

aarzilli commented Mar 28, 2023

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

heschi commented May 2, 2018 •

edited

Loading