-
Notifications
You must be signed in to change notification settings - Fork 327
Critical: Nondeterministic 100% CPU spin during init() causes programs to hang #1200
Comments
This looks like a scheduling problem somehow - I wonder if it is related to the go1.14 preemptive scheduling? Can you reproduce this on a different OS than macOS? Can you reproduce it with go1.13? If you can't then it would be worth doing a (interested bystander only :-) |
Thanks @ncw! I haven't been able to reproduce it on other OSes, nor have I heard of it surfacing from others using other OSes. However, one other team member @whitestrake has reported this happening on his Mac. I don't know what version or what version of Go, though. I'll see about doing a Go downgrade but since it requires Go 1.14 I'll have to make some changes to get it to work, first. And it does take some time to get the bug to surface through repeated trials, but I'll post any updates here. |
Just a thought... if this is a scheduler problem then you'll likely be able to make it happen quicker with |
We wonder if you can pull out more info about the |
Couple minor updates: I put the program in a bash loop and am able to easily reproduce it... FWIW, it's the code on this branch, and I'm running this command (after
If you see any output, the bug didn't manifest. When the bug manifests, there is no output and Activity Monitor shows 100% CPU use for anywhere between 5s to 50s (or maybe longer) @ncw Nice idea! But after putting I am really beginning to wonder if it's a bug in Go on macOS. I still need to downgrade to Go 1.13 and try again... Edit: Simplified reproduce instructions |
@mknyszek @heschik suggested |
@hyangah Never heard of that... thanks - lots of output!
I've also updated my instructions above to make it much simpler and faster to reproduce, using |
Strange. If goroutine 1 is running, I don't know why only g0s are showing up in the crash traceback. Maybe there really is a scheduler bug?? |
I'm now running my loop on a build that does not depend on smallstep or, by extension, opencensus -- and it has run hundreds of times now and not triggered the bug. Re-adding it, and re-building, I can trigger it in about 30-60 seconds using that loop. So far, I'm fairly convinced that it's a bug in opencensus or smallstep, my dependency that depends on opencensus. There's a chance that it's a bug in the way smallstep is using/initializing opencensus? but I don't know much about that. If it is a Go bug, it seems dependent on importing smallstep+opencensus. |
Upon downgrading to Go 1.13(.8) and recompiling and running the loop again, the bug does not manifest after hundreds of trials. When I switch back to the binary built by Go 1.14, the bug manifests in about 10 iterations or so. So, this might very well be a bug in Go. Anyone know what to do now? Sorry if this isn't directly related to your repo -- I was led here because of a stray goroutine in the stack trace -- I dunno, maybe Go is spinning on the |
If it only happens in 1.14, what happens if you set |
@zikaeroh Interesting! I switched back to Go 1.14. And the bug does not manifest when using I have now run the loop several times with and without that variable set.
|
Makes sense. Should probably mirror that feedback on the Go tracker (as I assume that's where they will actually want to discuss things if this is definitely another item on the async preemption bug tail). |
Okay, this is probably not a bug in opencensus after all -- it's still unclear why importing this package and/or smallstep seems necessary to trigger the bug, but since the bug is being discussed upstream in the Go issue tracker, I will close this. (Noting, however, that some original evidence appears in this issue, so it should still be linked to.) |
Please answer these questions before submitting a bug report.
What version of OpenCensus are you using?
v0.22.2
What version of Go are you using?
What did you do?
I'm using a dependency that uses a dependency that apparently uses opencensus-go.
It seems that importing the package
[email protected]/stats/view
is enough to trigger the bug. I noticed that mymain()
is never called, the CPU is spinning at 100%, and memory usage seems stable.By running
pkill -ABRT main
during the spin, I can get a stack trace:What did you expect to see?
The CPU to not spin at
init()
time, and the program to run.What did you see instead?
Sometimes, the CPU spins at 100% and my
main()
is never called. The program is stuck in the goroutine described by the stack trace above.Additional context
I cannot reliably reproduce it, just have to run the program a bunch of times, but it faithfully appears often enough.
I can't explain why this is happening, but I can only draw conclusions from the stack trace I pasted above. (They all look like that.)
When I eliminate the dependency that depends on this package, I cannot reproduce the spin anymore after many dozens of trials. (Usually it appears after about 10-20.
Related to #1074.
Edit: I have just noticed that the only goroutine that is
running
at the time of the dump isgoroutine 1
, according to the stack trace. Unfortunately it has no information as to what is being executed there... but when I no longer depend on opencensus-go, the problem disappears...init()
function, is there any package-level variable initializations (var ... = ...) that could possibly cause the CPU to spin?The text was updated successfully, but these errors were encountered: