Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include self correction on empty batch and avoid removing pending runners when cluster is busy #3426

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

nikola-jokic
Copy link
Collaborator

@nikola-jokic nikola-jokic commented Apr 11, 2024

Currently, listener patches the ephemeral runner set when it receives a job complete message, or the last patch target count is different from the desired count.

When one job finishes, 2 things happen in parallel:

  1. Server notifies the listener. This will trigger a patch request to the ephemeral runner set
  2. Ephemeral runner controller starts setting its phase to finished.

However, if the point 1 happens before point 2, then the ephemeral runner set controller will try scaling down (total number > target number). The scale down favors pending pods. The assumption is that they had the least amount of time to start, so they will likely be the last ones to receive a job. However, with the new scaling model on patch ID, that assumption is wrong. There may be situation where the pending pod (that would eventually receive a job) gets deleted, and will only self-correct on job complete message, or on the next batch.

So this PR aims to:

  1. Change the patching behavior. Patching is done whenever something actionable happens. That means, on every message, increase the patch ID to signal that the state should be checked again.
  2. Empty messages are propagated as well. This is a cheap call. Long poll happens every ~50s, and if message is empty, that would effectively be ~1.2 req/minute.
  3. When ephemeral runner set is busy, it does not remove pending/running runners. It assumes that some of them will update their status eventually, and they will be cleaned up. This is also not expensive, since ephemeral runner controller removes the pod and the secret right away, so only ephemeral runner object is going to be persisted during this time.
  4. Put more significance to patch ID 0. PatchID 0 means that we should forcefully reach this state. This occurs in 3 situations now:
    4.1. When listener starts. If something happens to the session, and the listener is restarted, it should force the state that is communicated by the actions service
    4.2. When there is an empty batch. Empty batch allows us to self-correct in case something unexpected has happened. This will restart the patchID sequence
    4.3. When draining mode is on. When the listener is removed, we should force 0 replicas and 0 patch ID, so we can drive this state to completion as quickly as we can.

Ephemeral runner tests are also split into chunks to avoid handling different scenarios in a single test.

Fixes #3420

@nikola-jokic nikola-jokic added the gha-runner-scale-set Related to the gha-runner-scale-set mode label Apr 11, 2024
@nikola-jokic nikola-jokic merged commit 963ae48 into master Apr 16, 2024
16 checks passed
@nikola-jokic nikola-jokic deleted the nikola-jokic/self-correct-on-empty-batch branch April 16, 2024 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline gets stuck randomly with "Job is waiting for a runner from XXX to come online"
2 participants