Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better metric name for nomad.broker.total_blocked #6480

Closed
preetapan opened this issue Oct 11, 2019 · 1 comment · Fixed by #15835
Closed

Better metric name for nomad.broker.total_blocked #6480

preetapan opened this issue Oct 11, 2019 · 1 comment · Fixed by #15835

Comments

@preetapan
Copy link
Contributor

preetapan commented Oct 11, 2019

The metric broker.total_blocked is poorly named. It is actually counting the number of pending evals for the same job in Nomad's eval broker.

Proposing renaming this (better name TBD) with a deprecation path. Plan is to log the metric with two names in a point release. The old name will be removed in a subsequent major release with deprecation notes.

@preetapan preetapan added this to the near-term milestone Oct 11, 2019
@preetapan preetapan modified the milestones: near-term, 0.10.2 Oct 30, 2019
@schmichael schmichael modified the milestones: 0.10.2, 0.10.3 Nov 19, 2019
@preetapan preetapan modified the milestones: 0.10.3, 0.12.0 Jan 29, 2020
@schmichael schmichael removed this from the 0.12.0 milestone Jun 26, 2020
@tgross tgross self-assigned this Jun 20, 2022
@tgross
Copy link
Member

tgross commented Jun 21, 2022

This came up again recently while debugging #13407 and having a metric named nomad.broker.total_blocked added significant delay because this metric is easily confused with the evaluation status "blocked" from structs.go#L10668-L10674

const (
	EvalStatusBlocked   = "blocked"
	EvalStatusPending   = "pending"
	EvalStatusComplete  = "complete"
	EvalStatusFailed    = "failed"
	EvalStatusCancelled = "canceled"
)

The eval broker's internal state is never reflected on the Evaluation.Status field, so having the metric name so similar is the point of confusion and should be avoided on a change. Arguably "blocked" is a great word here and it's the EvalStatusBlocked that would be better to change but that's written in countless raft stores so it's not terribly practical to rename. 😬 I'd like to push this forward, so let's bikeshed the new name a bit...

The current stats are (ref eval_broker.go#L836-L879 and the Metrics Reference docs):

metric description type
nomad.nomad.broker.total_blocked Evaluations that are blocked until an existing evaluation for the same job completes count
nomad.nomad.broker.total_ready Number of evaluations ready to be processed count
nomad.nomad.broker.total_unacked Evaluations dispatched for processing but incomplete count
nomad.nomad.broker.total_waiting Count of evals waiting to be enqueued count
nomad.nomad.broker.<type>_unacked Count of unacknowledged system evals count
nomad.nomad.broker.<type>_ready Count of evals in the ready state count
nomad.nomad.broker.eval_waiting Time elapsed with evaluation waiting to be enqueued time

ready is the term used for the evals that have been enqueued, whereas blocked (here) is "waiting on a previous eval for this same job before it can be enqueued". But we've already used "waiting" for those evals that are on a delay (the metrics reference doc is worded ambiguously here so I'll fix up that as well). So calling these "delayed" would swap the meaning and the name. Ideally we could change both metrics but repurposing a metric name doesn't have a graceful upgrade path!

Some options:

  • nomad.nomad.broker.total_not_ready: mirror of "ready"
  • nomad.nomad.broker.total_unready: another mirror of "ready"
  • nomad.nomad.broker.total_unenqueued: awkwardly named but better than "pending" which is another eval status
  • nomad.nomad.broker.total_congested: @tgross rummages around in the thesaurus for antonyms of "enqueued"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants