Include self correction on empty batch and avoid removing pending runners when cluster is busy #3426
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, listener patches the ephemeral runner set when it receives a job complete message, or the last patch target count is different from the desired count.
When one job finishes, 2 things happen in parallel:
However, if the point 1 happens before point 2, then the ephemeral runner set controller will try scaling down (total number > target number). The scale down favors pending pods. The assumption is that they had the least amount of time to start, so they will likely be the last ones to receive a job. However, with the new scaling model on patch ID, that assumption is wrong. There may be situation where the pending pod (that would eventually receive a job) gets deleted, and will only self-correct on job complete message, or on the next batch.
So this PR aims to:
4.1. When listener starts. If something happens to the session, and the listener is restarted, it should force the state that is communicated by the actions service
4.2. When there is an empty batch. Empty batch allows us to self-correct in case something unexpected has happened. This will restart the patchID sequence
4.3. When draining mode is on. When the listener is removed, we should force 0 replicas and 0 patch ID, so we can drive this state to completion as quickly as we can.
Ephemeral runner tests are also split into chunks to avoid handling different scenarios in a single test.
Fixes #3420