eval_broker: track enqueue and dequeue times #20329

pkazmierczak · 2024-04-09T09:09:21Z

Adds new metrics to the eval broker that track times of evaluations enqueueing and dequeueing.

nomad/eval_broker.go

website/content/docs/operations/metrics-reference.mdx

lgfa29

The leadership transition is probably the key point missing. Not sure if there's a "right" answer, so feel free to explore different approaches and trade-offs.

nomad/eval_broker.go

lgfa29 · 2024-04-11T16:31:43Z

nomad/eval_broker.go

@@ -724,6 +754,38 @@ func (b *EvalBroker) ResumeNackTimeout(evalID, token string) error {
 	return nil
 }

+func (b *EvalBroker) handleAckNackLocked(eval *structs.Evaluation) {


There's some potential weirdness here to consider. Ack and Nack are called with a string ID, but the eval may not exists in the unack queue, so I wonder if there's a possibility where enqueueTime and dequeueTime have an eval ID that is not in unack, causing them to never get cleaned up.

At some point I thought of passing evalID and eval here to handle the clean up but I was being lazy and dropped it 😅

nomad/eval_broker.go

website/content/docs/operations/metrics-reference.mdx

lgfa29 · 2024-04-11T16:52:37Z

website/content/docs/operations/metrics-reference.mdx

+| `nomad.nomad.broker.wait_time`               | Time elapsed while the evaluation was ready to be processed and waiting to be dequeued                                                                                                                            | ns / Evaluation Wait           | Timer   |
+| `nomad.nomad.broker.process_time`            | Time elapsed while the evaluation was dequeued and finished processing                                                                                                                                            | ns / Evaluation Process        | Timer   |
+| `nomad.nomad.broker.response_time`           | Time elapsed from when the evaluation was last enqueued and finished processing                                                                                                                                   | ns / Evaluation Response       | Timer   |


It may be worth expanding https://developer.hashicorp.com/nomad/docs/operations/monitoring-nomad#scheduling to mention how to use these metrics as well.

I think tracking response_time is going to be something important to track, and then wait_time and process_time can help nail down the problem. An increase in wait_time probably means too much load. An increase in process_time points more towards a server performance issue (CPU, disk, network etc.)

We likely won't know exactly what to write until we advance in our explorations, but good to keep in mind.

pkazmierczak · 2024-04-12T15:49:01Z

@lgfa29 @tgross I have a suggestion on how to handle leadership changes in a48e127. I implemented something Luiz suggested in one of his comments, basically the leader calls an evalBroker.Restore method now in restoreEvals() instead of the usual evalBroker.Enqueue. This isn't very elegant, because we have to pass around the information about whether to track {en,de}queue times inside eval broker methods, but then again public methods remain the same so the interface doesn't change.

Let me know what you think about this.

tgross · 2024-04-12T17:45:41Z

I have a suggestion on how to handle leadership changes in a48e127. I implemented something Luiz suggested in one of his comments, basically the leader calls an evalBroker.Restore method now in restoreEvals() instead of the usual evalBroker.Enqueue. This isn't very elegant, because we have to pass around the information about whether to track {en,de}queue times inside eval broker methods, but then again public methods remain the same so the interface doesn't change.

This will resolve the issue of misleading metrics by introducing a gap across leader elections. That seems fine so long as we document the expected behavior in the metrics reference.

This doesn't yet resolve the issue of clearing the maps of timings across leader elections. We should clear the maps between terms (either on step-up or step-down, doesn't really matter so whatever is convenient for the broker API).

pkazmierczak · 2024-04-15T08:46:43Z

This doesn't yet resolve the issue of clearing the maps of timings across leader elections. We should clear the maps between terms (either on step-up or step-down, doesn't really matter so whatever is convenient for the broker API).

Hmm, I thought clearing the map in the flush method as we do here a48e127#diff-727104c8049d79aa6b8bcd481b366a231033732bd9ad314e408933d6ac25f891R848-R849 would solve this problem, because the revokeLeadership method on leader https://github.com/hashicorp/nomad/blob/main/nomad/leader.go#L1488 calls it with enabled set to false? I'll keep looking, perhaps I misunderstand how revoking leadership works.

tgross · 2024-04-15T12:54:28Z

Hmm, I thought clearing the map in the flush method as we do here a48e127#diff-727104c8049d79aa6b8bcd481b366a231033732bd9ad314e408933d6ac25f891R848-R849 would solve this problem

🤦 yes, you're right that'll do it. Somehow I missed that.

Adds new metrics to the eval broker that track times of evaluations enqueueing and dequeueing.

eval_broker: track enqueue and dequeue times

7ceae15

vercel bot deployed to Preview – nomad-storybook-and-ui April 9, 2024 09:11 View deployment

pkazmierczak marked this pull request as draft April 9, 2024 13:25

Luiz's draft

5c13991

vercel bot deployed to Preview – nomad-storybook-and-ui April 9, 2024 17:39 View deployment

store 10000 evals only

40c733c

vercel bot deployed to Preview – nomad-storybook-and-ui April 11, 2024 08:36 View deployment

pkazmierczak added 2 commits April 11, 2024 10:43

documentation

859a6a6

changelog

67128f5

pkazmierczak self-assigned this Apr 11, 2024

pkazmierczak added the theme/metrics label Apr 11, 2024

pkazmierczak requested review from lgfa29 and tgross and removed request for lgfa29 April 11, 2024 08:45

pkazmierczak marked this pull request as ready for review April 11, 2024 08:46

pkazmierczak requested a review from lgfa29 April 11, 2024 08:46

removed weird formatting changes

13a525f

pkazmierczak mentioned this pull request Apr 11, 2024

Benchmark Tooling hashicorp-forge/nomad-bench#99

Closed

14 tasks

vercel bot deployed to Preview – nomad-storybook-and-ui April 11, 2024 08:51 View deployment

tgross reviewed Apr 11, 2024

View reviewed changes

nomad/eval_broker.go Show resolved Hide resolved

website/content/docs/operations/metrics-reference.mdx Outdated Show resolved Hide resolved

lgfa29 reviewed Apr 11, 2024

View reviewed changes

pkazmierczak added 6 commits April 12, 2024 10:01

Tim's comment on key metrics table

7e37065

pre-allocate maps

383b9ce

watch dequeued time map size before adding keys

268db7f

rename type label to eval_type

21a02c2

cleanup b.enqueuedTime if there's no dequeuedTime

95924e5

documentation

f39e114

vercel bot deployed to Preview – nomad-storybook-and-ui April 12, 2024 09:04 View deployment

vercel bot deployed to Preview – nomad April 12, 2024 09:06 View deployment

evalBroker.Restore method for leadership transitioning

a48e127

vercel bot deployed to Preview – nomad-storybook-and-ui April 12, 2024 15:44 View deployment

jrasell self-requested a review April 15, 2024 09:18

small optimization: only pre-allocate the maps if we're the leader

4837eeb

tgross approved these changes Apr 15, 2024

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui April 15, 2024 12:56 View deployment

control flow correction

82ecb8f

vercel bot deployed to Preview – nomad-storybook-and-ui April 15, 2024 14:06 View deployment

pkazmierczak merged commit 0d14dd9 into main Apr 15, 2024
21 checks passed

pkazmierczak deleted the f-enqueue-dequeue-metrics branch April 15, 2024 14:16

philrenaud pushed a commit that referenced this pull request Apr 18, 2024

eval_broker: track enqueue and dequeue times (#20329)

bc98aaa

Adds new metrics to the eval broker that track times of evaluations enqueueing and dequeueing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval_broker: track enqueue and dequeue times #20329

eval_broker: track enqueue and dequeue times #20329

pkazmierczak commented Apr 9, 2024 •

edited

Loading

lgfa29 left a comment

lgfa29 Apr 11, 2024

lgfa29 Apr 11, 2024

pkazmierczak commented Apr 12, 2024

tgross commented Apr 12, 2024

pkazmierczak commented Apr 15, 2024

tgross commented Apr 15, 2024

eval_broker: track enqueue and dequeue times #20329

eval_broker: track enqueue and dequeue times #20329

Conversation

pkazmierczak commented Apr 9, 2024 • edited Loading

lgfa29 left a comment

Choose a reason for hiding this comment

lgfa29 Apr 11, 2024

Choose a reason for hiding this comment

lgfa29 Apr 11, 2024

Choose a reason for hiding this comment

pkazmierczak commented Apr 12, 2024

tgross commented Apr 12, 2024

pkazmierczak commented Apr 15, 2024

tgross commented Apr 15, 2024

pkazmierczak commented Apr 9, 2024 •

edited

Loading