Various crashes/segfaults on one host #730

marcan · 2017-11-07T09:14:30Z

Host operating system: output of `uname -a`

Linux raider 4.13.7-rt-rt1 #1 SMP PREEMPT RT Mon Nov 6 00:37:13 JST 2017 x86_64 Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz GenuineIntel GNU/Linux

node_exporter version: output of `node_exporter --version`

Tried both the official binary release:

node_exporter, version 0.15.0 (branch: HEAD, revision: 6e2053c557f96efb63aef3691f15335a70baaffd)
  build user:       root@168089f37ad9
  build date:       20171006-11:33:58
  go version:       go1.9.1

And the same version, built from source via Gentoo package (from logs):

Starting node_exporter (version=0.15.0, branch=non-git, revision=6e2053c)
Build context (go=go1.9.1, user=portage@raider, date=20171105-15:39:31)

node_exporter command line flags

/usr/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/

node_exporter has been crashing on one host (my laptop) after running for hours (being scraped by prometheus running on another host). The failure messages vary, but seem to suggest some kind of memory corruption.

Crash 1 (self-built): https://mrcn.st/p/tMtz7sQF

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc41ffc7fff pc=0x41439e]

Crash 2 (self-built): https://mrcn.st/p/qmZw6trr

panic: runtime error: slice bounds out of range

Crash 3 (self-built): https://mrcn.st/p/qLYEaOg1

runtime: pointer 0xc4203e2fb0 to unallocated span idx=0x1f1 span.base()=0xc4203dc000 span.limit=0xc4203e6000 span.state=3
runtime: found in object at *(0xc420382a80+0x80)
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

Crash 4 (official release binary): https://mrcn.st/p/x4NGGxF7

unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x76b998]

I realize this sounds like bad hardware, but this is my daily workstation and it's otherwise reasonably stable (as stable as one can expect a Gentoo ~arch box with a lot of desktop apps, graphics drivers involved, etc to be anyway). I don't have reason to suspect the hardware, and this machine gets plenty of stress testing (it's Gentoo, so lots of compiling). My initial guess is a wild pointer somewhere is causing the breakage which manifests itself it various ways. Any idea how to track this down?

The text was updated successfully, but these errors were encountered:

SuperQ · 2017-11-07T09:33:28Z

Yikes.

Given the completely random crash locations, I'm also leaning towards blaming hardware. Time for a memtest run. 😬

marcan · 2017-11-07T15:11:06Z

So I went on a memtest expedition and, indeed, found one (1) bad bit in my RAM. I intend to just mask it out in software, but for now I took out the half of the RAM that contained the culprit bit (and two weak bits of three I also found, but those are fine at normal temperature). I also booted a regular (non-rt) kernel to take out that variable too.

node_exporter still promptly crashed, though, in yet another way. Sorry, but I still think it's most likely a software issue (or a really nasty very-specific-software-triggers-very-specific-hardware-issue problem, think CPU bug); node_exporter isn't special enough to just randomly get treated to the single bad bit I had in 32GB of RAM every time it started previously ;-)

squeed · 2017-11-07T15:16:48Z

Eesh, a crash in the golang GC? Time to figure out which joker hid a gamma radiation source in your data center.

squeed · 2017-11-07T15:17:35Z

Sorry, that sounded off-putting. However, the stacktrace you've posted still points at hardware issues.

marcan · 2017-11-07T15:24:11Z

I know, it smells like a hardware problem, but node_exporter is the only software with this kind of issue in an otherwise rather active workstation with quite a heterogeneous workload, which suggests otherwise.

Also, I actually own a geiger counter, and I'm getting a slightly elevated (for my location) but entirely within normal background range reading of 0.12µSv/h. So that's out too :-)

SuperQ · 2017-11-07T15:26:36Z

The node_exporter reads a lot of data from /proc, one of the stacks you posted showed a garbage output from your kernel. Could be related to why it's crashing. But almost all of that is filtered by things like Go's strconv.ParseInt()

squeed · 2017-11-07T15:30:01Z

Hmm. One random thought: I've heard random rumors of problems with golang in ebuilds before, but nothing definite. Can you try building the binary manually with go build? Even better would be on another non-gentoo machine.

marcan · 2017-11-07T15:33:02Z

I tried the binary from this repo too (it's the one used for the last two crashes). I could try a manual build too, if you think that would help any?

squeed · 2017-11-07T15:33:37Z

Ah, nevermind then. I'm out of ideas.

marcan · 2017-11-07T15:35:13Z

Yet another completely distinct crash. This is starting to get amusing.

marcan · 2017-11-07T15:36:58Z

I'm attaching strace to see if I can catch anything "interesting" in what it was doing next time it crashes.

marcan · 2017-11-07T15:42:27Z

Also, think it might help if I try to bisect the collector list and see if I can narrow it down to a specific one?

SuperQ · 2017-11-07T15:52:18Z

Yes, I was trying to see if I could spot a specific collector in any of the traces, but it all seems to be crashes in the prometheus client library, which could point to the actual problem being there, not the node_exporter part.

SuperQ · 2017-11-07T16:04:13Z

Another way to narrow things down, it would be useful to test the previous 0.14.0 release binaries.

marcan · 2017-11-07T17:45:59Z

OK, I brought up a bunch of parallel instances with a single collector each, plus one control, plus the straced system instance, all being scraped by prom. I'll leave them running overnight and see which ones die, then test an older release.

marcan · 2017-11-07T19:15:15Z

Well, the control just died as I was about to get some sleep, so instead I brought up all of 0.{11,12,13,14,15}.0 and the just-released 0.15.1. None of the single-collector instances or the straced main one have died yet. We'll see what happens overnight.

marcan · 2017-11-08T03:53:47Z

Interesting. The only version that survived is 0.11.0, all later ones died (0.12.0, 0.13.0, 0.14.0, 0.15.0, 0.15.1). The straced one is still alive, which suggests the nature of the problem might be a race condition or similar, if running it under strace masks it. As for the single-collector instances, only one died: the one running the edac collector segfaulted. Of course, the edac collector only showed up in 0.14.0, so that might be a red herring. Nonetheless, I'm now running an instance with only edac disabled to see if it helps.

I found these two golang issues which sound like they might be related. I also spawned an instance with GOMAXPROCS=1 to see if that also works around the problem.

Addendum: worth noting that my laptop has EDAC support compiled into the kernel, but no compatible hardware, so the globs in edac_linux.go don't match anything. Probably a red herring then.

marcan · 2017-11-08T06:12:54Z

0.15.1 with edac disabled died too, so that's out. Nothing else has died yet, including a respawn of the edac-only instance. I'm strongly leaning towards a core race condition issue, which would be improved by less parallelism (fewer collectors enabled), as well as GOMAXPROCS=1 and strace.

marcan · 2017-11-08T06:22:09Z

I just compiled and ran this reproducer for golang issue 20427 and it reliably crashes on my host within 10 seconds (go1.9.1 linux/amd64). I think I'm onto something here.

SuperQ · 2017-11-08T07:33:11Z

The reproducer uses os/exec, which is only used in the megacli collector.

Thanks for all the reproduction testing, it would be fun to find a Golang bug. :-)

marcan · 2017-11-08T07:38:44Z

I'm not sure the problem is in os/exec, though. That might just be an easy way to reproduce it.

FWIW, if this is indeed a golang bug, this is the second time for me. I had some fun three years ago debugging and fixing a year-old crash in the runtime related to cgo. That one was "fun"...

SuperQ · 2017-11-08T07:42:32Z

Of course, I just wanted to make sure it was clear that it's not something we use a lot in our code, and we're actively wanting to remove the megacli collector because it does fork processes.

marcan · 2017-11-08T14:21:13Z

Pivoting to that Go reproducer and assuming it's the same root cause, I've managed to repro as a VM on three different Intel hosts with different CPU generations, and an AMD host (see that bug). Seems related to the kernel (but I have three kernel builds that trigger it, so it isn't a one-off bad kernel compile).

This is getting fun.

squeed · 2017-11-08T14:59:07Z

Indeed it is. Is it only on realtime kernels?

marcan · 2017-11-08T17:04:17Z

Nope, I've been testing mostly 4.13.9-gentoo now.

It seems the GCC version used to compile the kernel matters. I've repro'd with the same kernel version and config and patches built on two hosts with the same GCC/ld versions, but not with the same kernel/config/patches built on a third host with an older GCC/ld. Moving kernels around, the reproducibility follows the kernel, not the host I run it on.

I swear, if this winds up being a GCC bug subtly breaking the kernel subtly breaking Go subtly breaking node_exporter...

SuperQ · 2017-11-15T08:36:15Z

Looks like the upstream Go isssue, golang/go#20427, has been fixed. It will take some time to get this into a Go release, and then into a node_exporter release.

marcan · 2017-11-17T17:18:56Z

I haven't confirmed that that patch indeed fixes this bug (it was a conjecture); I'll do so this weekend, though I expect it will.

marcan · 2017-11-22T10:40:38Z

A bit late, but running the rebuilt node_exporter now. If it's fine in 24h I'll call it fixed.

SuperQ · 2017-11-22T11:04:14Z

Nice, hopefully the patch makes it into a Golang release soon.

marcan · 2017-11-22T18:55:16Z

Well... it died, but for a completely different reason (#738), after gigabytes of logs and pegged CPU usage. I just added --no-collector.textfile and am trying again.

marcan · 2017-11-24T03:22:00Z

No crashes, looks good! I think this is fixed with that go fix. Feel free to close or leave this issue open until the fix trickles into a Go release and subsequent node_exporter release.

marcan · 2017-12-07T11:00:00Z

Ha, now hitting the crashes on some unrelated Gentoo infra... that just got updated to GCC 6.4.0 (now stable). Those kernels have CONFIG_OPTIMIZE_INLINING=y, but apparently still end up getting stack probes in vDSO due to whatever other combination of config options (I suspect kvmclock, since these are VMs and that definitely adds code to vDSO).

I think for the time being I'll just run with GOMAXPROCS=1 until this trickles down to a Go release.

SuperQ · 2018-01-25T15:44:31Z

It looks like the upstream golang fix is in Go 1.9.3. The next release will be built with this version.

dvusboy · 2018-04-10T21:55:49Z

What a fascinating read!

discordianfish · 2018-04-11T11:05:25Z

@SuperQ I think this can be closed now, right?

SuperQ · 2018-04-11T12:04:20Z

Yes, this is now fixed in 0.16 releases, as it's built with Go 1.10.x

marcan mentioned this issue Nov 8, 2017

runtime: memory corruption crashes with os/exec on Linux kernel 4.4 golang/go#20427

Closed

SuperQ closed this as completed Apr 11, 2018

SuperQ mentioned this issue Apr 23, 2018

fatal error: runtime.unlock: lock count with Go >= 1.10 #870

Closed

Various crashes/segfaults on one host #730

Various crashes/segfaults on one host #730

Comments

marcan commented Nov 7, 2017 • edited Loading

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

SuperQ commented Nov 7, 2017

marcan commented Nov 7, 2017 • edited Loading

squeed commented Nov 7, 2017

squeed commented Nov 7, 2017

marcan commented Nov 7, 2017

SuperQ commented Nov 7, 2017

squeed commented Nov 7, 2017 • edited Loading

marcan commented Nov 7, 2017

squeed commented Nov 7, 2017

marcan commented Nov 7, 2017

marcan commented Nov 7, 2017

marcan commented Nov 7, 2017

SuperQ commented Nov 7, 2017

SuperQ commented Nov 7, 2017

marcan commented Nov 7, 2017

marcan commented Nov 7, 2017

marcan commented Nov 8, 2017 • edited Loading

marcan commented Nov 8, 2017 • edited Loading

marcan commented Nov 8, 2017

SuperQ commented Nov 8, 2017

marcan commented Nov 8, 2017 • edited Loading

SuperQ commented Nov 8, 2017

marcan commented Nov 8, 2017 • edited Loading

squeed commented Nov 8, 2017

marcan commented Nov 8, 2017 • edited Loading

SuperQ commented Nov 15, 2017

marcan commented Nov 17, 2017

marcan commented Nov 22, 2017

SuperQ commented Nov 22, 2017

marcan commented Nov 22, 2017

marcan commented Nov 24, 2017

marcan commented Dec 7, 2017 • edited Loading

SuperQ commented Jan 25, 2018

dvusboy commented Apr 10, 2018

discordianfish commented Apr 11, 2018

SuperQ commented Apr 11, 2018

marcan commented Nov 7, 2017 •

edited

Loading

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`

marcan commented Nov 7, 2017 •

edited

Loading

squeed commented Nov 7, 2017 •

edited

Loading

marcan commented Nov 8, 2017 •

edited

Loading

marcan commented Nov 8, 2017 •

edited

Loading

marcan commented Nov 8, 2017 •

edited

Loading

marcan commented Nov 8, 2017 •

edited

Loading

marcan commented Nov 8, 2017 •

edited

Loading

marcan commented Dec 7, 2017 •

edited

Loading