Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC#0192 - Ensure workers do not get unnecessarily killed #192

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JohanLorenzo
Copy link

No description provided.

@JohanLorenzo JohanLorenzo requested a review from a team as a code owner June 20, 2024 10:31
@JohanLorenzo JohanLorenzo requested review from lotas, petemoore and matt-boris and removed request for a team June 20, 2024 10:31
@JohanLorenzo JohanLorenzo changed the title RFC#0192 - ensures workers do not get unnecessarily killed RFC#0192 - Ensure workers do not get unnecessarily killed Jun 20, 2024

### What if we deploy new worker images?

Long-lived workers will have to be killed if there's a change in their config, including
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes along with the #191 proposal quite well.
With introduction of launch configurations, workers would be able to query w-m periodically to see if config is still active, and restart/shutdown if not.

We could probably create a more generic endpoint which would answer if worker should terminate:

  • yes if config changed
  • no if minCapacity is not observed.

However, I'm not totally sure how to solve this issue if all workers at once ask if they could shutdown, and all get yes.. just to have zero running workers after

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes along with the #191 proposal quite well.

Nice! 👍

However, I'm not totally sure how to solve this issue if all workers at once ask if they could shutdown, and all get yes.. just to have zero running workers after

I guess that's an okay behavior, at least for a first iteration. Maybe worker-manager could remember how many workers are being shut down so it doesn't tell more workers to turn off.


## When `minCapacity` is not yet met

Here, `worker-manager` should increase `afterIdleSeconds` to a much higher value (e.g.:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"tweaking" idle seconds could be a good solution for existing workers that are unlikely to get upgraded soon.

There should be some upper limit of what worker-manager could set for such workers to balance minCapacity promise and avoiding additional costs from running them for too long

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, if we set minCapacity then we expect to have a number of workers at any time of the day. I would expect these workers to keep running. If some worker-types shouldn't be running for too long, I wonder if we should set minCapacity to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true! Indeed, if someone sets min value he probably knows what he's doing, no need to restrict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry term I think this isn't a problem.

Long term something probabilistic could work: p(no) = 0.5 if exactly at minCapacity, p lower if higher than minCapacity

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need a probabilistic model at this stage. To me, there are already several random factors impacting workers (e.g.: spot instance termination, network issues). So having a deterministic model here may help diagnosing what goes wrong with the implementation. I may be missing something in my understanding of the problem, though.

@JohanLorenzo
Copy link
Author

@lotas 👋 Do you see anything more we should discuss? If not, is okay to open up the discussion in the next community meeting?

@lotas
Copy link
Contributor

lotas commented Jul 22, 2024

@lotas 👋 Do you see anything more we should discuss? If not, is okay to open up the discussion in the next community meeting?

please bring it up during the next community meeting, would be best! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants