-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eval_broker: track enqueue and dequeue times #20329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The leadership transition is probably the key point missing. Not sure if there's a "right" answer, so feel free to explore different approaches and trade-offs.
@@ -724,6 +754,38 @@ func (b *EvalBroker) ResumeNackTimeout(evalID, token string) error { | |||
return nil | |||
} | |||
|
|||
func (b *EvalBroker) handleAckNackLocked(eval *structs.Evaluation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some potential weirdness here to consider. Ack
and Nack
are called with a string ID, but the eval may not exists in the unack
queue, so I wonder if there's a possibility where enqueueTime
and dequeueTime
have an eval ID that is not in unack
, causing them to never get cleaned up.
At some point I thought of passing evalID
and eval
here to handle the clean up but I was being lazy and dropped it 😅
| `nomad.nomad.broker.wait_time` | Time elapsed while the evaluation was ready to be processed and waiting to be dequeued | ns / Evaluation Wait | Timer | | ||
| `nomad.nomad.broker.process_time` | Time elapsed while the evaluation was dequeued and finished processing | ns / Evaluation Process | Timer | | ||
| `nomad.nomad.broker.response_time` | Time elapsed from when the evaluation was last enqueued and finished processing | ns / Evaluation Response | Timer | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth expanding https://developer.hashicorp.com/nomad/docs/operations/monitoring-nomad#scheduling to mention how to use these metrics as well.
I think tracking response_time
is going to be something important to track, and then wait_time
and process_time
can help nail down the problem. An increase in wait_time
probably means too much load. An increase in process_time
points more towards a server performance issue (CPU, disk, network etc.)
We likely won't know exactly what to write until we advance in our explorations, but good to keep in mind.
@lgfa29 @tgross I have a suggestion on how to handle leadership changes in a48e127. I implemented something Luiz suggested in one of his comments, basically the leader calls an Let me know what you think about this. |
This will resolve the issue of misleading metrics by introducing a gap across leader elections. That seems fine so long as we document the expected behavior in the metrics reference. This doesn't yet resolve the issue of clearing the maps of timings across leader elections. We should clear the maps between terms (either on step-up or step-down, doesn't really matter so whatever is convenient for the broker API). |
Hmm, I thought clearing the map in the |
🤦 yes, you're right that'll do it. Somehow I missed that. |
Adds new metrics to the eval broker that track times of evaluations enqueueing and dequeueing.
Adds new metrics to the eval broker that track times of evaluations enqueueing and dequeueing.