-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various crashes/segfaults on one host #730
Comments
Yikes. Given the completely random crash locations, I'm also leaning towards blaming hardware. Time for a memtest run. 😬 |
So I went on a memtest expedition and, indeed, found one (1) bad bit in my RAM. I intend to just mask it out in software, but for now I took out the half of the RAM that contained the culprit bit (and two weak bits of three I also found, but those are fine at normal temperature). I also booted a regular (non-rt) kernel to take out that variable too. node_exporter still promptly crashed, though, in yet another way. Sorry, but I still think it's most likely a software issue (or a really nasty very-specific-software-triggers-very-specific-hardware-issue problem, think CPU bug); node_exporter isn't special enough to just randomly get treated to the single bad bit I had in 32GB of RAM every time it started previously ;-) |
Eesh, a crash in the golang GC? Time to figure out which joker hid a gamma radiation source in your data center. |
Sorry, that sounded off-putting. However, the stacktrace you've posted still points at hardware issues. |
I know, it smells like a hardware problem, but node_exporter is the only software with this kind of issue in an otherwise rather active workstation with quite a heterogeneous workload, which suggests otherwise. Also, I actually own a geiger counter, and I'm getting a slightly elevated (for my location) but entirely within normal background range reading of 0.12µSv/h. So that's out too :-) |
The node_exporter reads a lot of data from /proc, one of the stacks you posted showed a garbage output from your kernel. Could be related to why it's crashing. But almost all of that is filtered by things like Go's |
Hmm. One random thought: I've heard random rumors of problems with golang in ebuilds before, but nothing definite. Can you try building the binary manually with |
I tried the binary from this repo too (it's the one used for the last two crashes). I could try a manual build too, if you think that would help any? |
Ah, nevermind then. I'm out of ideas. |
Yet another completely distinct crash. This is starting to get amusing. |
I'm attaching strace to see if I can catch anything "interesting" in what it was doing next time it crashes. |
Also, think it might help if I try to bisect the collector list and see if I can narrow it down to a specific one? |
Yes, I was trying to see if I could spot a specific collector in any of the traces, but it all seems to be crashes in the prometheus client library, which could point to the actual problem being there, not the node_exporter part. |
Another way to narrow things down, it would be useful to test the previous 0.14.0 release binaries. |
OK, I brought up a bunch of parallel instances with a single collector each, plus one control, plus the straced system instance, all being scraped by prom. I'll leave them running overnight and see which ones die, then test an older release. |
Well, the control just died as I was about to get some sleep, so instead I brought up all of 0.{11,12,13,14,15}.0 and the just-released 0.15.1. None of the single-collector instances or the straced main one have died yet. We'll see what happens overnight. |
Interesting. The only version that survived is 0.11.0, all later ones died (0.12.0, 0.13.0, 0.14.0, 0.15.0, 0.15.1). The straced one is still alive, which suggests the nature of the problem might be a race condition or similar, if running it under strace masks it. As for the single-collector instances, only one died: the one running the I found these two golang issues which sound like they might be related. I also spawned an instance with GOMAXPROCS=1 to see if that also works around the problem. Addendum: worth noting that my laptop has EDAC support compiled into the kernel, but no compatible hardware, so the globs in edac_linux.go don't match anything. Probably a red herring then. |
0.15.1 with |
I just compiled and ran this reproducer for golang issue 20427 and it reliably crashes on my host within 10 seconds (go1.9.1 linux/amd64). I think I'm onto something here. |
The reproducer uses Thanks for all the reproduction testing, it would be fun to find a Golang bug. :-) |
I'm not sure the problem is in os/exec, though. That might just be an easy way to reproduce it. FWIW, if this is indeed a golang bug, this is the second time for me. I had some fun three years ago debugging and fixing a year-old crash in the runtime related to cgo. That one was "fun"... |
Of course, I just wanted to make sure it was clear that it's not something we use a lot in our code, and we're actively wanting to remove the |
Pivoting to that Go reproducer and assuming it's the same root cause, I've managed to repro as a VM on three different Intel hosts with different CPU generations, and an AMD host (see that bug). Seems related to the kernel (but I have three kernel builds that trigger it, so it isn't a one-off bad kernel compile). This is getting fun. |
Indeed it is. Is it only on realtime kernels? |
Nope, I've been testing mostly 4.13.9-gentoo now. It seems the GCC version used to compile the kernel matters. I've repro'd with the same kernel version and config and patches built on two hosts with the same GCC/ld versions, but not with the same kernel/config/patches built on a third host with an older GCC/ld. Moving kernels around, the reproducibility follows the kernel, not the host I run it on. I swear, if this winds up being a GCC bug subtly breaking the kernel subtly breaking Go subtly breaking node_exporter... |
Looks like the upstream Go isssue, golang/go#20427, has been fixed. It will take some time to get this into a Go release, and then into a node_exporter release. |
I haven't confirmed that that patch indeed fixes this bug (it was a conjecture); I'll do so this weekend, though I expect it will. |
A bit late, but running the rebuilt node_exporter now. If it's fine in 24h I'll call it fixed. |
Nice, hopefully the patch makes it into a Golang release soon. |
Well... it died, but for a completely different reason (#738), after gigabytes of logs and pegged CPU usage. I just added |
No crashes, looks good! I think this is fixed with that go fix. Feel free to close or leave this issue open until the fix trickles into a Go release and subsequent node_exporter release. |
Ha, now hitting the crashes on some unrelated Gentoo infra... that just got updated to GCC 6.4.0 (now stable). Those kernels have CONFIG_OPTIMIZE_INLINING=y, but apparently still end up getting stack probes in vDSO due to whatever other combination of config options (I suspect kvmclock, since these are VMs and that definitely adds code to vDSO). I think for the time being I'll just run with GOMAXPROCS=1 until this trickles down to a Go release. |
It looks like the upstream golang fix is in Go 1.9.3. The next release will be built with this version. |
What a fascinating read! |
@SuperQ I think this can be closed now, right? |
Yes, this is now fixed in 0.16 releases, as it's built with Go 1.10.x |
Host operating system: output of
uname -a
Linux raider 4.13.7-rt-rt1 #1 SMP PREEMPT RT Mon Nov 6 00:37:13 JST 2017 x86_64 Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz GenuineIntel GNU/Linux
node_exporter version: output of
node_exporter --version
Tried both the official binary release:
And the same version, built from source via Gentoo package (from logs):
node_exporter command line flags
node_exporter has been crashing on one host (my laptop) after running for hours (being scraped by prometheus running on another host). The failure messages vary, but seem to suggest some kind of memory corruption.
Crash 1 (self-built): https://mrcn.st/p/tMtz7sQF
Crash 2 (self-built): https://mrcn.st/p/qmZw6trr
Crash 3 (self-built): https://mrcn.st/p/qLYEaOg1
Crash 4 (official release binary): https://mrcn.st/p/x4NGGxF7
I realize this sounds like bad hardware, but this is my daily workstation and it's otherwise reasonably stable (as stable as one can expect a Gentoo ~arch box with a lot of desktop apps, graphics drivers involved, etc to be anyway). I don't have reason to suspect the hardware, and this machine gets plenty of stress testing (it's Gentoo, so lots of compiling). My initial guess is a wild pointer somewhere is causing the breakage which manifests itself it various ways. Any idea how to track this down?
The text was updated successfully, but these errors were encountered: