Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad process dies with panic: counter cannot decrease in value #15861

Closed
rbastiaans-tc opened this issue Jan 24, 2023 · 15 comments · Fixed by #18835
Closed

nomad process dies with panic: counter cannot decrease in value #15861

rbastiaans-tc opened this issue Jan 24, 2023 · 15 comments · Fixed by #18835

Comments

@rbastiaans-tc
Copy link

rbastiaans-tc commented Jan 24, 2023

Nomad version

Nomad v1.3.1 (2b054e38e91af964d1235faa98c286ca3f527e56)

Operating system and Environment details

Distributor ID:	Debian
Description:	Debian GNU/Linux 10 (buster)
Release:	10
Codename:	buster

Issue

The Nomad agent process died with a message: panic: counter cannot decrease in value

See full panic output below.

Reproduction steps

Unknown

Expected Result

Nomad stay running

Actual Result

Nomad process exits

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Sun 2023-01-22 16:57:52 UTC nomadclient013 nomad[3403]: ==> Newer Nomad version available: 1.4.3 (currently running: 1.3.1)
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: panic: counter cannot decrease in value
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: goroutine 7598137 [running]:
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: github.com/prometheus/client_golang/prometheus.(*counter).Add(...)
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/[email protected]/prometheus/counter.go:109
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: github.com/prometheus/client_golang/prometheus.(*goCollector).Collect(0xc000280080, 0xc00235bf60)
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/[email protected]/prometheus/go_collector_go117.go:147 +0x5ec
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/[email protected]/prometheus/registry.go:446 +0x102
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/[email protected]/prometheus/registry.go:457 +0x4e8
Sun 2023-01-22 16:58:41 UTC nomadclient013 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sun 2023-01-22 16:58:41 UTC nomadclient013 systemd[1]: nomad.service: Failed with result 'exit-code'.
Sun 2023-01-22 16:58:43 UTC nomadclient013 systemd[1]: nomad.service: Service RestartSec=2s expired, scheduling restart.
Sun 2023-01-22 16:58:43 UTC nomadclient013 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 1.

Nomad config

data_dir = "/var/lib/nomad"

client {
    enabled = true
    options {
        docker.auth.config = "/etc/docker/dockercfg.json"
    }
    meta {
        az = "az-1"
    }
}

vault {
    enabled = true
    tls_skip_verify = true
    address = "https://xxxx"
    namespace = "xxxx"
}

telemetry {
    publish_allocation_metrics = true
    publish_node_metrics = true
    prometheus_metrics = true
     
}

consul {
    tags = ["no-http"]
}
@lgfa29
Copy link
Contributor

lgfa29 commented Jan 24, 2023

Hi @rbastiaans-tc 👋

Thanks for the report. Unfortunately we will need more information in order to understand what's going wrong.

Looking at the code this panic is raised by the Prometheus client library when the counter attempts to add a negative value.

This library is used by go-metrics in the IncrCounter and IncrCounterWithLabels methods, but all uses of these methods in Nomad pass either 1 or the len(...) of something, which is always positive.

So in order to understand what's going on with your cluster I built custom binaries that output more information when the panic occurs. You can find them at the bottom of this page: https://github.com/hashicorp/nomad/actions/runs/4000378647

One very important thing is that these binaries are for test only and should not be run in production, so if you could, please copy the data_dir of the agent presenting this problem somewhere else and run the custom binary pointing to the new directory.

The changes included in these binaries can be viewed here:

Another thing that could be relevant, what CPU architecture are you using?

@rbastiaans-tc
Copy link
Author

Hi @rbastiaans-tc 👋

@lgfa29 I'm not sure how soon and if we can run that on short term. As far as I know now, unfortunately we only had this issue occur once, in our production cluster, so far.

Upgrading Nomad version even in our dev-environment is not that easy, because we are suffering from another Nomad bug where Nomad agent restarts causes jobs using CSI Volumes to get lost allocations #13028. Therefor we cannot do in-place upgrades that easily.

Even if we could put your modified version in place easily, it might take a long time for this bug to re-occur.

I was planning to upgrade to 1.3.8 first to hopefully get rid of some of these bugs associated with CSI volumes.

Besides the root cause of the panic, I would imagine the Nomad process should never crash because of a issue with just telemetry or monitoring metrics only. Especially in production environments.

Wouldn't catching that panic be a relatively easy fix? I will see about running that modified version, but I'm not sure if I can do that anytime soon.

Another thing that could be relevant, what CPU architecture are you using?

We are running x86_64 or "amd64" version of Debian.

$ uname -a
Linux xxx 4.19.0-20-cloud-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux

$ file nomad
nomad: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8aa832ffba2495157c039cc24548a9738520f98e, for GNU/Linux 3.2.0, not stripped

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 26, 2023

Ah no worries, I thought this something that was always happening when your agent started so it could be tested quickly.

Besides the root cause of the panic, I would imagine the Nomad process should never crash because of a issue with just telemetry or monitoring metrics only. Especially in production environments.

Wouldn't catching that panic be a relatively easy fix?

I definitely agree with this, but it's not so simple to fix this from the Nomad side. The telemetry library we use is called in several places and, as far as I can tell, we always send values that are supposed to be positive.

I will have to discuss with the rest of the team how to best handle this. It would probably require changes to the go-metrics module.

@rbastiaans-tc
Copy link
Author

Ah no worries, I thought this something that was always happening when your agent started so it could be tested quickly.

Ah perhaps that wasn't clear from my report. It happened after running Nomad for a long time, on only 1 machine in the cluster so far.

Thanks so much for looking into this @lgfa29

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 27, 2023

Got it, thanks! Yeah, we've been using this library for a while and haven't seen any reports about this before. I also haven't heard from other teams that also use it. So it seems like it was a unfortunate, but very rare, situation.

@lgfa29 lgfa29 removed their assignment Feb 1, 2023
@rbastiaans-tc
Copy link
Author

This happened again last night @lgfa29

Different machine, same Nomad version.

Mar  2 23:35:58 nomadclient002 nomad[8008]: panic: counter cannot decrease in value
Mar  2 23:35:58 nomadclient002 nomad[8008]: goroutine 8369928 [running]:
Mar  2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*counter).Add(...)
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/[email protected]/prometheus/counter.go:109
Mar  2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*goCollector).Collect(0xc000058180, 0xc001c11f60)
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/[email protected]/prometheus/go_collector_go117.go:147 +0x5ec
Mar  2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/[email protected]/prometheus/registry.go:446 +0x102
Mar  2 23:35:58 nomadclient002 nomad[8008]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/[email protected]/prometheus/registry.go:538 +0xb4d
Mar  2 23:35:58 nomadclient002 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar  2 23:35:58 nomadclient002 systemd[1]: nomad.service: Failed with result 'exit-code'.

@tgross
Copy link
Member

tgross commented Mar 3, 2023

It occurs to me that any stack trace isn't going to be that useful b/c it's getting messages over a channel, so this is just the prom goroutine. I wonder if we're starting from the wrong position here. If we know we aren't trying to decrement a counter, maybe something else is? The prometheus client gets metrics from the go runtime. I don't see anything obvious in go_collector.go or in the golang issues for MemStats, but it'd be worth some investigation here I think.

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 10, 2023

Ah that's right, we emit metrics for the runtime environment, and probably other things I'm not remembering right now 😅

I think the best option we have is to move forward with hashicorp/go-metrics#146, that would prevent crashes and also give us more information about which metric is behaving unexpectedly.

@rbastiaans-tc
Copy link
Author

@tgross @lgfa29 Can we please get that metrics PR merged and that into a Nomad release?

This is still happening for me on occasion and in production environments it's not really great, in combination with that CSI volume bug. It means that whenever Nomad dies of this panic, also all our jobs running there using CSI volumes get killed and rescheduled.

So if we could get a fix at least for this panic it will already help us, regardless of the CSI volume bug which seems more difficult to tackle. This metrics PR seems like an easier win.

@tgross
Copy link
Member

tgross commented Oct 19, 2023

@lhossack has reported seeing the logging we added to detect this case in #18804, so that's a good start on next steps for this issue.

tgross added a commit that referenced this issue Oct 23, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: #15861
Fixes: #18804
@tgross
Copy link
Member

tgross commented Oct 23, 2023

Fix in #18835. See my comment here #18804 (comment) for a detailed breakdown of the problem.

@tgross tgross added this to the 1.7.0 milestone Oct 23, 2023
tgross added a commit that referenced this issue Oct 24, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: #15861
Fixes: #18804
tgross added a commit that referenced this issue Oct 24, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Backport-of: #18835
tgross added a commit that referenced this issue Oct 24, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Backport-of: #18835
tgross added a commit that referenced this issue Oct 24, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Backport-of: #18835
tgross added a commit that referenced this issue Oct 24, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Backport-of: #18835
@chenk008
Copy link

chenk008 commented Apr 2, 2024

I think it is related to prometheus/client_golang#969

@lgfa29
Copy link
Contributor

lgfa29 commented Apr 4, 2024

Hi @chenk008 👋

Have you encountered this issue after #18835 was released in Nomad 1.6.3?

@chenk008
Copy link

chenk008 commented Apr 5, 2024

@lgfa29 Sorry for disturb you, i have the same error in my project with client_golang 1.12.0. After searching, I only found this result, so I left a comment. The goroutine stack is in client_golang, it is a bug in client_golang 1.12.0.

@lgfa29
Copy link
Contributor

lgfa29 commented Apr 5, 2024

No worries @chenk008, I mostly trying to understand the exact issue you've experience.

Would you be able to open a new issue and post the exact error from log output and Nomad version you're using?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

4 participants