-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nomad process dies with panic: counter cannot decrease in value
#15861
Comments
Hi @rbastiaans-tc 👋 Thanks for the report. Unfortunately we will need more information in order to understand what's going wrong. Looking at the code this panic is raised by the Prometheus client library when the counter attempts to add a negative value. This library is used by So in order to understand what's going on with your cluster I built custom binaries that output more information when the panic occurs. You can find them at the bottom of this page: https://github.com/hashicorp/nomad/actions/runs/4000378647 One very important thing is that these binaries are for test only and should not be run in production, so if you could, please copy the The changes included in these binaries can be viewed here: Another thing that could be relevant, what CPU architecture are you using? |
@lgfa29 I'm not sure how soon and if we can run that on short term. As far as I know now, unfortunately we only had this issue occur once, in our production cluster, so far. Upgrading Nomad version even in our dev-environment is not that easy, because we are suffering from another Nomad bug where Nomad agent restarts causes jobs using CSI Volumes to get lost allocations #13028. Therefor we cannot do in-place upgrades that easily. Even if we could put your modified version in place easily, it might take a long time for this bug to re-occur. I was planning to upgrade to 1.3.8 first to hopefully get rid of some of these bugs associated with CSI volumes. Besides the root cause of the panic, I would imagine the Nomad process should never crash because of a issue with just telemetry or monitoring metrics only. Especially in production environments. Wouldn't catching that panic be a relatively easy fix? I will see about running that modified version, but I'm not sure if I can do that anytime soon.
We are running x86_64 or "amd64" version of Debian.
|
Ah no worries, I thought this something that was always happening when your agent started so it could be tested quickly.
I definitely agree with this, but it's not so simple to fix this from the Nomad side. The telemetry library we use is called in several places and, as far as I can tell, we always send values that are supposed to be positive. I will have to discuss with the rest of the team how to best handle this. It would probably require changes to the |
Ah perhaps that wasn't clear from my report. It happened after running Nomad for a long time, on only 1 machine in the cluster so far. Thanks so much for looking into this @lgfa29 |
Got it, thanks! Yeah, we've been using this library for a while and haven't seen any reports about this before. I also haven't heard from other teams that also use it. So it seems like it was a unfortunate, but very rare, situation. |
This happened again last night @lgfa29 Different machine, same Nomad version.
|
It occurs to me that any stack trace isn't going to be that useful b/c it's getting messages over a channel, so this is just the prom goroutine. I wonder if we're starting from the wrong position here. If we know we aren't trying to decrement a counter, maybe something else is? The prometheus client gets metrics from the go runtime. I don't see anything obvious in |
Ah that's right, we emit metrics for the runtime environment, and probably other things I'm not remembering right now 😅 I think the best option we have is to move forward with hashicorp/go-metrics#146, that would prevent crashes and also give us more information about which metric is behaving unexpectedly. |
@tgross @lgfa29 Can we please get that metrics PR merged and that into a Nomad release? This is still happening for me on occasion and in production environments it's not really great, in combination with that CSI volume bug. It means that whenever Nomad dies of this panic, also all our jobs running there using CSI volumes get killed and rescheduled. So if we could get a fix at least for this panic it will already help us, regardless of the CSI volume bug which seems more difficult to tackle. This metrics PR seems like an easier win. |
The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Fixes: #15861 Fixes: #18804
Fix in #18835. See my comment here #18804 (comment) for a detailed breakdown of the problem. |
The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Fixes: #15861 Fixes: #18804
The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Backport-of: #18835
The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Backport-of: #18835
The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Backport-of: #18835
The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Backport-of: #18835
I think it is related to prometheus/client_golang#969 |
@lgfa29 Sorry for disturb you, i have the same error in my project with client_golang 1.12.0. After searching, I only found this result, so I left a comment. The goroutine stack is in client_golang, it is a bug in client_golang 1.12.0. |
No worries @chenk008, I mostly trying to understand the exact issue you've experience. Would you be able to open a new issue and post the exact error from log output and Nomad version you're using? Thanks! |
Nomad version
Operating system and Environment details
Issue
The Nomad agent process died with a message:
panic: counter cannot decrease in value
See full panic output below.
Reproduction steps
Unknown
Expected Result
Nomad stay running
Actual Result
Nomad process exits
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Nomad config
The text was updated successfully, but these errors were encountered: