-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When there are blocked evaluations, nomad.nomad.blocked_evals.[cpu,memory] are always 0 #13759
Comments
Hi Dan, sorry for the frustration. I'm afraid we use the word "blocked" at least 3 different ways in metrics (if you're curious about our attempts to fixup at least one use: #6480). From the example you posted it does seem like
Note that this metric will only be emitted by the leader server in Nomad clusters as that is where blocked evaluations are tracked. That number is intended to be useful for autoscaling, so I think you're looking at the right "blocked." Note that as recently discussed in #13740 that if you're using quotas (enterprise only), then jobs that are blocked because they would exceed their quota limit are not counted in this metric. The idea is that you don't want your autoscaler to increase cluster capacity when it's the quota that's blocking placement! Since the Perhaps you were think of the I'm going to close this for now since I think you have what you need, but please do not hesitate to reopen it if you have further questions or issues! |
@schmichael thanks for the info and the quick response! I may not have been clear enough in what I'm asking about. While I can get things working with hashicorp/nomad-autoscaler#584 is effectively what I'm after here -- the ability to clue the autoscaler in to the fact that there are blocked evals, either split out by datacenter or node_class. For now I have a workaround using I tested this by creating a job that could never be scheduled due to available memory in the cluster, and was surprised that My sample job for reference: job "example" {
datacenters = ["workers"]
group "cache" {
network {
port "db" {
to = 6379
}
}
task "redis" {
driver = "docker"
config {
image = "redis:3.2"
ports = ["db"]
}
resources {
cpu = 500
# This exceeds the free memory in the cluster and sets the metrics I outlined in the issue.
memory = 5000
}
}
}
} |
Ah! I confused Assuming the datacenter exists and that there is not enough memory available to satisfy the request I get:
Note only the However looking at the evaluation I see it evaluated a node:
That
Due to the code here: https://github.com/hashicorp/nomad/blob/v1.3.2/nomad/blocked_evals_stats.go#L102-L108 Not sure where things are falling apart, so reopening this for investigation. |
This PR fixes a bug where blocked evaluations with no class set would not have metrics exported at the dc:class scope. Fixes #13759
This PR fixes a bug where blocked evaluations with no class set would not have metrics exported at the dc:class scope. Fixes #13759
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.3.0
Operating system and Environment details
Nomad running on Debian on AWS
Issue
I'm trying to use
nomad.nomad.blocked_evals.memory
as information to pass to the Nomad autoscaler to scale out the cluster, but the metric is always unset.Ex.
It looks like the per-job stats are being populated, but the per-node stats are not being populated. Is there something I'm missing here, since per-node stats are enabled in the
telemetry
block and I'd expect these values to be non-zero if there are blocked allocations.Reproduction steps
Try and deploy a Nomad job with a memory resource request that exceeds what is available in a cluster.
Expected Result
Non-zero metrics
Actual Result
Empty metrics
The text was updated successfully, but these errors were encountered: