Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎨Autoscaling: Add option to allow a new node to join a cluster directly active (🚨 ⚠️ DEVOPS) #6334

Merged

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 10, 2024

What do these changes do?

This PR adds the option for a node to join the swarm as directly active, as it seems to be more time efficient.
Since it needs to be tested in real conditions, this is for now by default not set.

Adds AUTOSCALING_DOCKER_JOIN_DRAINED ENV var that allows to make a new node directly join as an active node, potentially speeding up docker swarm mechanics (to be tested)
Adds AUTOSCALING_WAIT_FOR_CLOUD_INIT_BEFORE_WARM_BUFFER_ACTIVATION ENV var that skips waiting for Warm buffer cloud initialization (in direct testing brought a 20% speed up)

Related issue/s

How to test

Dev-ops checklist

@sanderegg sanderegg added a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:autoscaling autoscaling service in simcore's stack labels Sep 10, 2024
@sanderegg sanderegg added this to the Eisbock milestone Sep 10, 2024
@sanderegg sanderegg self-assigned this Sep 10, 2024
Copy link

codecov bot commented Sep 10, 2024

Codecov Report

Attention: Patch coverage is 63.63636% with 8 lines in your changes missing coverage. Please review.

Project coverage is 88.1%. Comparing base (cafbf96) to head (815b6a7).
Report is 521 commits behind head on master.

Files with missing lines Patch % Lines
...e_service_autoscaling/modules/auto_scaling_core.py 33.3% 8 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #6334      +/-   ##
=========================================
+ Coverage    84.5%   88.1%    +3.5%     
=========================================
  Files          10    1499    +1489     
  Lines         214   62062   +61848     
  Branches       25    2065    +2040     
=========================================
+ Hits          181   54707   +54526     
- Misses         23    7039    +7016     
- Partials       10     316     +306     
Flag Coverage Δ
integrationtests 63.4% <ø> (?)
unittests 86.2% <63.6%> (+1.6%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...aling/src/simcore_service_autoscaling/constants.py 100.0% <100.0%> (ø)
...g/src/simcore_service_autoscaling/core/settings.py 100.0% <100.0%> (ø)
...e_autoscaling/modules/buffer_machines_pool_core.py 88.4% <100.0%> (ø)
...ore_service_autoscaling/utils/auto_scaling_core.py 93.6% <100.0%> (ø)
.../simcore_service_autoscaling/utils/utils_docker.py 100.0% <100.0%> (ø)
...e_service_autoscaling/modules/auto_scaling_core.py 91.3% <33.3%> (ø)

... and 1442 files with indirect coverage changes

@sanderegg sanderegg marked this pull request as ready for review September 10, 2024 09:30
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

Q: I understand that this PR introduces the ability to "reuse" nodes that are currently active. In cases where nodes have been previously utilized, is there a risk that residual data or processes from prior services could impact the new services? Alternatively, can you confirm that the machine will operate in a "clean" state, as if newly provisioned?

@sanderegg
Copy link
Member Author

thx

Q: I understand that this PR introduces the ability to "reuse" nodes that are currently active. In cases where nodes have been previously utilized, is there a risk that residual data or processes from prior services could impact the new services? Alternatively, can you confirm that the machine will operate in a "clean" state, as if newly provisioned?

No, when a machine is created, it runs a docker swarm join --availability=drain to connect with the swarm.
This PR allows to change the --availability flag to "active" or "drain" depending if it is False or True

Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sanderegg sanderegg force-pushed the autoscaling/allow-join-undrained branch from 361906e to b27a1de Compare September 10, 2024 10:33
@sanderegg sanderegg changed the title 🎨Autoscaling: Add option to allow a new node to join a cluster directly active 🎨Autoscaling: Add option to allow a new node to join a cluster directly active (🚨) Sep 10, 2024
@sanderegg sanderegg force-pushed the autoscaling/allow-join-undrained branch 2 times, most recently from c33d7de to 815b6a7 Compare September 10, 2024 12:50
@sanderegg sanderegg force-pushed the autoscaling/allow-join-undrained branch from 815b6a7 to 73c884a Compare September 10, 2024 13:42
@sanderegg sanderegg merged commit 1320b69 into ITISFoundation:master Sep 10, 2024
@sanderegg sanderegg deleted the autoscaling/allow-join-undrained branch September 10, 2024 13:42
Copy link

@sanderegg sanderegg changed the title 🎨Autoscaling: Add option to allow a new node to join a cluster directly active (🚨) 🎨Autoscaling: Add option to allow a new node to join a cluster directly active (🚨 ⚠️ DEVOPS) Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:autoscaling autoscaling service in simcore's stack a:infra+ops maintenance of infrastructure or operations (discussed in retro)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants