Include self correction on empty batch and avoid removing pending runners when cluster is busy #3426

nikola-jokic · 2024-04-11T13:02:27Z

Currently, listener patches the ephemeral runner set when it receives a job complete message, or the last patch target count is different from the desired count.

When one job finishes, 2 things happen in parallel:

Server notifies the listener. This will trigger a patch request to the ephemeral runner set
Ephemeral runner controller starts setting its phase to finished.

However, if the point 1 happens before point 2, then the ephemeral runner set controller will try scaling down (total number > target number). The scale down favors pending pods. The assumption is that they had the least amount of time to start, so they will likely be the last ones to receive a job. However, with the new scaling model on patch ID, that assumption is wrong. There may be situation where the pending pod (that would eventually receive a job) gets deleted, and will only self-correct on job complete message, or on the next batch.

So this PR aims to:

Change the patching behavior. Patching is done whenever something actionable happens. That means, on every message, increase the patch ID to signal that the state should be checked again.
Empty messages are propagated as well. This is a cheap call. Long poll happens every ~50s, and if message is empty, that would effectively be ~1.2 req/minute.
When ephemeral runner set is busy, it does not remove pending/running runners. It assumes that some of them will update their status eventually, and they will be cleaned up. This is also not expensive, since ephemeral runner controller removes the pod and the secret right away, so only ephemeral runner object is going to be persisted during this time.
Put more significance to patch ID 0. PatchID 0 means that we should forcefully reach this state. This occurs in 3 situations now:
4.1. When listener starts. If something happens to the session, and the listener is restarted, it should force the state that is communicated by the actions service
4.2. When there is an empty batch. Empty batch allows us to self-correct in case something unexpected has happened. This will restart the patchID sequence
4.3. When draining mode is on. When the listener is removed, we should force 0 replicas and 0 patch ID, so we can drive this state to completion as quickly as we can.

Ephemeral runner tests are also split into chunks to avoid handling different scenarios in a single test.

Fixes #3420

…ners when cluster is busy

Include self correction on empty batch and avoid removing pending run…

6cd233b

…ners when cluster is busy

nikola-jokic requested review from mumoshu, toast-gear, rentziass and a team as code owners April 11, 2024 13:02

nikola-jokic added the gha-runner-scale-set Related to the gha-runner-scale-set mode label Apr 11, 2024

rentziass approved these changes Apr 16, 2024

View reviewed changes

nikola-jokic merged commit 963ae48 into master Apr 16, 2024
16 checks passed

nikola-jokic deleted the nikola-jokic/self-correct-on-empty-batch branch April 16, 2024 10:55

nikola-jokic added this to the gha-runner-scale-set-0.9.1 milestone Apr 16, 2024

nikola-jokic mentioned this pull request Apr 17, 2024

Prepare 0.9.1 release #3448

Merged

wasd171 mentioned this pull request Apr 17, 2024

0.9.1 does not work with on-demand provisioned nodes #3450

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include self correction on empty batch and avoid removing pending runners when cluster is busy #3426

Include self correction on empty batch and avoid removing pending runners when cluster is busy #3426

nikola-jokic commented Apr 11, 2024 •

edited

Loading

Include self correction on empty batch and avoid removing pending runners when cluster is busy #3426

Include self correction on empty batch and avoid removing pending runners when cluster is busy #3426

Conversation

nikola-jokic commented Apr 11, 2024 • edited Loading

nikola-jokic commented Apr 11, 2024 •

edited

Loading