-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative metrics in 0.9.0 for service tasks. #5570
Comments
@the-maldridge Can you provide a job spec file that triggers this behavior and the output of /v1/metrics on the client you are seeing this in? I tried an example redis job on a test cluster, below is the output and I currently don't see negative values. This could be an edge case triggered by specific resource stanza requirements, so having that is useful to help us debug further. What I see when running a redis service job on a node:
|
I am seeing this as well for unallocated cpu, memory, etc. Clients with no batch jobs running on them do not emit negative telemetry values. Clients with a mix of batch and service do emit negative values. Nomad v0.9.0. CentOS 7. Dogstatsd telemetry handler. The green trace here is the host with no batch jobs: |
@stevenscg - Any other info you can provide? What drivers are these tasks using? Based on my investigation thus far this looks pretty driver specific, I don't see it so far with docker/raw_exec running a shell script. |
All of our jobs use the docker driver. Client telemetry config looks like this:
Server telemetry config looks like this:
Also, the green trace shown above is showing a positive around 1900 and is what I would expect for this host and cluster. |
@stevenscg thanks. Can you also post the JSON response from curl on the "/v1/metrics" from one of the negative value nodes (red or purple above)? |
@preetapan Emailed the metrics response to nomad-oss-debug. |
@the-maldridge , @stevenscg : a fix for this has been merged to master and should be part of the upcoming 0.9.2. a linux build with this fix is attached if you are interested in testing this out. |
@cgbaker I'd be happy to test, if I pull from master at that commit are most other things stable? |
I'm still seeing negative metrics emited to statsd on 0.9.5 The server only runs one job (I've removed the templates with configuration as it contains some sensitive info). job "traefik" {
datacenters = ["dc1"]
type = "service"
constraint {
attribute = "${node.unique.name}"
value = "gateway"
}
group "server" {
count = 1
ephemeral_disk {
size = 3000
}
task "traefik" {
driver = "docker"
vault {
policies = ["cert"]
}
config {
image = "traefik:1.7.12"
volumes = [
"local/traefik.toml:/etc/traefik/traefik.toml",
"secrets/certs:/certs"
]
port_map {
https = 443
dashboard = 8080
}
}
resources {
network {
port "https" {
static = 443
}
port "dashboard" {
static = 8080
}
}
memory = 3500
}
service {
name = "traefik"
port = "https"
check {
name = "Traefik TCP Alive"
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
}
}
} |
Hey there Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this. Thanks! |
hey @Stale I'm still seeing the issue |
Hey there Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this. Thanks! |
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.9.0
Operating system and Environment details
Alpine Linux AMD64
Issue
Unallocated client CPU appears to be affected by service/batch tasks and is reporting negative values for these tasks.
Reproduction steps
Install Nomad 0.9.0, enable telemetry, submit a service job, observe bug.
The text was updated successfully, but these errors were encountered: