Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval_broker: track enqueue and dequeue times #20329

Merged
merged 15 commits into from
Apr 15, 2024
Merged

Conversation

pkazmierczak
Copy link
Contributor

@pkazmierczak pkazmierczak commented Apr 9, 2024

Adds new metrics to the eval broker that track times of evaluations enqueueing and dequeueing.

@pkazmierczak pkazmierczak self-assigned this Apr 11, 2024
@pkazmierczak pkazmierczak requested review from lgfa29 and tgross and removed request for lgfa29 April 11, 2024 08:45
@pkazmierczak pkazmierczak marked this pull request as ready for review April 11, 2024 08:46
@pkazmierczak pkazmierczak requested a review from lgfa29 April 11, 2024 08:46
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leadership transition is probably the key point missing. Not sure if there's a "right" answer, so feel free to explore different approaches and trade-offs.

nomad/eval_broker.go Outdated Show resolved Hide resolved
nomad/eval_broker.go Outdated Show resolved Hide resolved
@@ -724,6 +754,38 @@ func (b *EvalBroker) ResumeNackTimeout(evalID, token string) error {
return nil
}

func (b *EvalBroker) handleAckNackLocked(eval *structs.Evaluation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some potential weirdness here to consider. Ack and Nack are called with a string ID, but the eval may not exists in the unack queue, so I wonder if there's a possibility where enqueueTime and dequeueTime have an eval ID that is not in unack, causing them to never get cleaned up.

At some point I thought of passing evalID and eval here to handle the clean up but I was being lazy and dropped it 😅

nomad/eval_broker.go Show resolved Hide resolved
Comment on lines 103 to 105
| `nomad.nomad.broker.wait_time` | Time elapsed while the evaluation was ready to be processed and waiting to be dequeued | ns / Evaluation Wait | Timer |
| `nomad.nomad.broker.process_time` | Time elapsed while the evaluation was dequeued and finished processing | ns / Evaluation Process | Timer |
| `nomad.nomad.broker.response_time` | Time elapsed from when the evaluation was last enqueued and finished processing | ns / Evaluation Response | Timer |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth expanding https://developer.hashicorp.com/nomad/docs/operations/monitoring-nomad#scheduling to mention how to use these metrics as well.

I think tracking response_time is going to be something important to track, and then wait_time and process_time can help nail down the problem. An increase in wait_time probably means too much load. An increase in process_time points more towards a server performance issue (CPU, disk, network etc.)

We likely won't know exactly what to write until we advance in our explorations, but good to keep in mind.

@pkazmierczak
Copy link
Contributor Author

@lgfa29 @tgross I have a suggestion on how to handle leadership changes in a48e127. I implemented something Luiz suggested in one of his comments, basically the leader calls an evalBroker.Restore method now in restoreEvals() instead of the usual evalBroker.Enqueue. This isn't very elegant, because we have to pass around the information about whether to track {en,de}queue times inside eval broker methods, but then again public methods remain the same so the interface doesn't change.

Let me know what you think about this.

@tgross
Copy link
Member

tgross commented Apr 12, 2024

I have a suggestion on how to handle leadership changes in a48e127. I implemented something Luiz suggested in one of his comments, basically the leader calls an evalBroker.Restore method now in restoreEvals() instead of the usual evalBroker.Enqueue. This isn't very elegant, because we have to pass around the information about whether to track {en,de}queue times inside eval broker methods, but then again public methods remain the same so the interface doesn't change.

This will resolve the issue of misleading metrics by introducing a gap across leader elections. That seems fine so long as we document the expected behavior in the metrics reference.

This doesn't yet resolve the issue of clearing the maps of timings across leader elections. We should clear the maps between terms (either on step-up or step-down, doesn't really matter so whatever is convenient for the broker API).

@pkazmierczak
Copy link
Contributor Author

This doesn't yet resolve the issue of clearing the maps of timings across leader elections. We should clear the maps between terms (either on step-up or step-down, doesn't really matter so whatever is convenient for the broker API).

Hmm, I thought clearing the map in the flush method as we do here a48e127#diff-727104c8049d79aa6b8bcd481b366a231033732bd9ad314e408933d6ac25f891R848-R849 would solve this problem, because the revokeLeadership method on leader https://github.com/hashicorp/nomad/blob/main/nomad/leader.go#L1488 calls it with enabled set to false? I'll keep looking, perhaps I misunderstand how revoking leadership works.

@jrasell jrasell self-requested a review April 15, 2024 09:18
@tgross
Copy link
Member

tgross commented Apr 15, 2024

Hmm, I thought clearing the map in the flush method as we do here a48e127#diff-727104c8049d79aa6b8bcd481b366a231033732bd9ad314e408933d6ac25f891R848-R849 would solve this problem

🤦 yes, you're right that'll do it. Somehow I missed that.

@pkazmierczak pkazmierczak merged commit 0d14dd9 into main Apr 15, 2024
21 checks passed
@pkazmierczak pkazmierczak deleted the f-enqueue-dequeue-metrics branch April 15, 2024 14:16
philrenaud pushed a commit that referenced this pull request Apr 18, 2024
Adds new metrics to the eval broker that track times of evaluations enqueueing
and dequeueing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants