RFC#0192 - Ensure workers do not get unnecessarily killed #192

JohanLorenzo · 2024-06-20T10:31:43Z

No description provided.

lotas · 2024-06-20T13:44:25Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+
+### What if we deploy new worker images?
+
+Long-lived workers will have to be killed if there's a change in their config, including


This goes along with the #191 proposal quite well.
With introduction of launch configurations, workers would be able to query w-m periodically to see if config is still active, and restart/shutdown if not.

We could probably create a more generic endpoint which would answer if worker should terminate:

yes if config changed

no if minCapacity is not observed.

However, I'm not totally sure how to solve this issue if all workers at once ask if they could shutdown, and all get yes.. just to have zero running workers after

This goes along with the #191 proposal quite well.

Nice! 👍

However, I'm not totally sure how to solve this issue if all workers at once ask if they could shutdown, and all get yes.. just to have zero running workers after

I guess that's an okay behavior, at least for a first iteration. Maybe worker-manager could remember how many workers are being shut down so it doesn't tell more workers to turn off.

lotas · 2024-06-20T13:46:51Z

rfcs/0192-min-capacity-ensures-workers-do-not-get-unnecessarily-killed.md

+
+## When `minCapacity` is not yet met
+
+Here, `worker-manager` should increase `afterIdleSeconds` to a much higher value (e.g.:


"tweaking" idle seconds could be a good solution for existing workers that are unlikely to get upgraded soon.

There should be some upper limit of what worker-manager could set for such workers to balance minCapacity promise and avoiding additional costs from running them for too long

To me, if we set minCapacity then we expect to have a number of workers at any time of the day. I would expect these workers to keep running. If some worker-types shouldn't be running for too long, I wonder if we should set minCapacity to 0.

that's true! Indeed, if someone sets min value he probably knows what he's doing, no need to restrict

Sorry term I think this isn't a problem.

Long term something probabilistic could work: p(no) = 0.5 if exactly at minCapacity, p lower if higher than minCapacity

I wonder if we need a probabilistic model at this stage. To me, there are already several random factors impacting workers (e.g.: spot instance termination, network issues). So having a deterministic model here may help diagnosing what goes wrong with the implementation. I may be missing something in my understanding of the problem, though.

JohanLorenzo · 2024-07-18T13:08:42Z

@lotas 👋 Do you see anything more we should discuss? If not, is okay to open up the discussion in the next community meeting?

lotas · 2024-07-22T12:17:05Z

@lotas 👋 Do you see anything more we should discuss? If not, is okay to open up the discussion in the next community meeting?

please bring it up during the next community meeting, would be best! Thanks

JohanLorenzo requested a review from a team as a code owner June 20, 2024 10:31

JohanLorenzo requested review from lotas, petemoore and matt-boris and removed request for a team June 20, 2024 10:31

JohanLorenzo changed the title ~~RFC#0192 - ensures workers do not get unnecessarily killed~~ RFC#0192 - Ensure workers do not get unnecessarily killed Jun 20, 2024

RFC#0192 - ensures workers do not get unnecessarily killed

21f1df3

JohanLorenzo force-pushed the rfc-192 branch from a90eb39 to 21f1df3 Compare June 20, 2024 12:12

lotas reviewed Jun 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC#0192 - Ensure workers do not get unnecessarily killed #192

RFC#0192 - Ensure workers do not get unnecessarily killed #192

JohanLorenzo commented Jun 20, 2024

lotas Jun 20, 2024

JohanLorenzo Jun 20, 2024

lotas Jun 20, 2024

JohanLorenzo Jun 20, 2024

lotas Jun 20, 2024

djmitche Jun 20, 2024

JohanLorenzo Jun 21, 2024

JohanLorenzo commented Jul 18, 2024

lotas commented Jul 22, 2024


		### What if we deploy new worker images?

		Long-lived workers will have to be killed if there's a change in their config, including


		## When `minCapacity` is not yet met

		Here, `worker-manager` should increase `afterIdleSeconds` to a much higher value (e.g.:

RFC#0192 - Ensure workers do not get unnecessarily killed #192

Are you sure you want to change the base?

RFC#0192 - Ensure workers do not get unnecessarily killed #192

Conversation

JohanLorenzo commented Jun 20, 2024

lotas Jun 20, 2024

Choose a reason for hiding this comment

JohanLorenzo Jun 20, 2024

Choose a reason for hiding this comment

lotas Jun 20, 2024

Choose a reason for hiding this comment

JohanLorenzo Jun 20, 2024

Choose a reason for hiding this comment

lotas Jun 20, 2024

Choose a reason for hiding this comment

djmitche Jun 20, 2024

Choose a reason for hiding this comment

JohanLorenzo Jun 21, 2024

Choose a reason for hiding this comment

JohanLorenzo commented Jul 18, 2024

lotas commented Jul 22, 2024