-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: split stack overflow and broken stack traces #5723
Labels
Comments
---------- Forwarded message ---------- From: Russ Cox <[email protected]> Date: Mon, Jun 17, 2013 at 8:27 AM Subject: runtime CL moratorium To: golang-dev <[email protected]> The runtime is too flaky to get work done. See golang.org/issue/5723. Until that issue is closed, please stop writing, sending for review, discussing, and submitting CLs making changes to the runtime. Thank you. Russ |
Rob, haven't you changed pkg/syscall/exec_bsd.go:59 to use Syscall instead of RawSyscall locally? Otherwise I do not understand how this is possible: goroutine 9 [running]: [fp=0x6c3558] syscall.Syscall() /Users/r/go/src/pkg/syscall/asm_darwin_amd64.s:15 +0x5 [fp=0x6c35d8] syscall.forkAndExecInChild(0xc210968720, 0xc210176d80, 0x3, 0x3, 0xc21096d160, ...) /Users/r/go/src/pkg/syscall/exec_bsd.go:59 +0xf4 Such change is not safe, it can have bad consequences for child process. |
>Notice that the first goroutine in the trace says it is in semacquire. The first stack should always be the running goroutine, so I'd like to know, in addition to the split stack problem, how it is that a goroutine stuck in semacquire can possibly be splitting its stack. The crash happened on g0, we do not print g0 stacks. The goroutine that was running at the time of crash (m->curg) is: goroutine 9 [running]: [fp=0x6c3558] syscall.Syscall() /Users/r/go/src/pkg/syscall/asm_darwin_amd64.s:15 +0x5 [fp=0x6c35d8] syscall.forkAndExecInChild(0xc210968720, 0xc210176d80, 0x3, 0x3, 0xc21096d160, ...) /Users/r/go/src/pkg/syscall/exec_bsd.go:59 +0xf4 [fp=0x6c36f8] syscall.forkExec(0xc2104790c0, 0x5b, 0xc210176ca0, 0x2, 0x2, ...) /Users/r/go/src/pkg/syscall/exec_unix.go:191 +0x4ac [fp=0x6c3748] syscall.StartProcess(0xc2104790c0, 0x5b, 0xc210176ca0, 0x2, 0x2, ...) /Users/r/go/src/pkg/syscall/exec_unix.go:238 +0x74 [fp=0x6c3848] os.startProcess(0xc2104790c0, 0x5b, 0xc210176ca0, 0x2, 0x2, ...) /Users/r/go/src/pkg/os/exec_posix.go:45 +0x33e [fp=0x6c3898] os.StartProcess(0xc2104790c0, 0x5b, 0xc210176ca0, 0x2, 0x2, ...) /Users/r/go/src/pkg/os/doc.go:24 +0x5c [fp=0x6c3a20] os/exec.(*Cmd).Start(0xc210967b40, 0xc212c82276, 0x5abf00) /Users/r/go/src/pkg/os/exec/exec.go:264 +0x587 [fp=0x6c3eb0] main.(*builder).runTest(0xc21008b2a0, 0xc2104390d0, 0x6c3f38, 0x11) /Users/r/go/src/cmd/go/test.go:748 +0x809 [fp=0x6c3f48] main.func·005(0xc2104390d0) /Users/r/go/src/cmd/go/build.go:661 +0x2e8 [fp=0x6c3fa0] main.func·006() /Users/r/go/src/cmd/go/build.go:718 +0x109 [fp=0x6c3fa8] runtime.goexit() /Users/r/go/src/pkg/runtime/proc.c:1243 created by main.(*builder).do /Users/r/go/src/cmd/go/build.go:724 +0x3b7 |
Today's scorecard: 1) Rob's split stack overflow. Dmitriy points out that it was goroutine 9 that had the overflow. That's great. However, the stack trace does not make sense. First, Rob did not have any pending changes to package syscall or package runtime. Second, the runtime did recognize the goroutine as part of the problem, so it included [fp=xxx] annotations to dump the frame pointers. It does this if m->throwing && gp == m->curg. That's the same test showframe uses to decide to include internal runtime call frames. But they are missing from that goroutine stack. A real program would have called runtime.entersyscall from syscall.Syscall, yet that frame is missing. Also, entersyscall would not have been enough to eat up the 0x500 or so bytes remaining on the stack at that point. No, the stack being shown for goroutine 9 is completely bogus. It is being dumped by tracebackothers, which uses gp->sched as the initial frame information. But in the case of a running goroutine, gp->sched is stale. It records the pc and sp of the most recent saving of gp->sched, which happened during a call to syscall.Syscall some time ago, and that call is over. The rest of the stack trace may be accurate, since it was read directly from the stack memory, but we don't know what's below, what used up the 0x590 bytes between the fp we have information for and the bottom of the stack. What we'd really like to know are morebuf.sp and morepc, and they are not printed. At the least we need to make the printing here actually print useful information. But I still think we should be able to puzzle through what might have been happening on the stack to use up the remaining space. Just don't get misled by the syscall.Syscall frame. 2) My index bounds checks not appearing. No progress has been made on this. I have not gotten any real programming done today. It is possible that this is the same problem as (3) - the last record I have of rebuilding that machine's Go tree was June 13 at 12:20 or so California time. Before that it had been not since before Go 1.1 (April 29). What I don't know is whether I observed the missing stacks before or after 12:20. 3) My bad behavior from runtime.Stack in a Go 1.1 program This was caused by a bug fixed in CL 9307045. Because there was no issue filed about it, we did not consider it for Go 1.1.1, so Go 1.1.1 still suffers from the bug. I created issue #5724 so that we put it into Go 1.1.2. |
We're also seeing a crash that looks like it might be related to: https://code.google.com/p/go/source/detail?r=bb553699c7e6 Yesterday was a holiday, so I'll check again whether we're still seeing this later today. My previous message: We're seeing a crash that dvyukov suggested might be related to this CL. runtime: unknown argument frame size for runtime.copy fatal error: invalid stack goroutine 0 [idle]: runtime: unknown argument frame size for runtime.sigreturn panic during panic No stack trace beyond that. I suspect it's happening in code with quite a deep stack that is using the copy builtin on some memory mapped with syscall.Mmap. We've seen crashes with: linux/amd64 devel +e274daf75c26 Thu Jun 13 17:04:47 2013 -0700 linux/amd64 devel +52d53d0e177e Fri Jun 14 11:14:45 2013 +0200 We also saw one segfault in the same test a few days ago while running: linux/amd64 devel +ae79f385177d Wed Jun 12 14:05:13 2013 +1000 It didn't produce a stack trace in that case either. It's quite a large chunk of code, but it hasn't seen any changes in a while, so I think it's probably the runtime, not us. If it would help we can try and reduce it to something we can submit to the issue tracker. It's going to take us a while, so if anyone has any ideas before we start, that would be nice. |
An update: 1) Rob's split stack overflow. All the signs point to the stack trace being correct except for s/Syscall/RawSyscall/ (itself a long story). That is, the goroutine stack was definitely in the call to fork, or it had returned from fork but not yet executed any other calls. That fact, combined with Rob's comment that the go command (which this is clearly a trace for) continued executing instead of dying, very strongly suggests that the split stack overflow happened in the child of fork. I cannot explain how. I have multiple theories, all fairly unlikely, and all involving signals arriving at the child at just the wrong time. The most likely theory combines the signal with an OS X kernel bug. None are worth recounting in detail. CL 10360048, out for review, cleans up the stack trace that a split stack overflow shows. If the bug happens again (and it's not the OS X kernel bug) then the stack trace should have enough information next time. 2) I still haven't written any real code, so I haven't run into the problem again. But I am willing to assume for now that I only observed the problem before rebuilding on June 13 around 12:00 PDT and that the fix is the same as 3. (Until then my Go tree on that machine had last been built on April 29.) 3) Fixed (see previous update, in comment #11). Once CL 10360048 is submitted, I will close this bug, and then we can review the preemption stuff. After that, I hope there will be no changes to anything involving stacks again for a while, so that things can settle. |
This issue was closed by revision 6fa3c89. Status changed to Fixed. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: