Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(⚠️ devops) ♻️✨Adding distributed locking to throttle concurrent saves on nodes #3160

Merged
merged 47 commits into from
Jul 12, 2022

Conversation

GitHK
Copy link
Contributor

@GitHK GitHK commented Jul 1, 2022

⚠️ devops

Redis is now used by the director-v2. Please check that the following env vars are passed to the director-v2:

  • REDIS_HOST
  • REDIS_PORT
  • REDIS_USER
  • REDIS_PASSWORD

Only on AWS STAGING set DYNAMIC_SIDECAR_DOCKER_NODE_RESOURCE_LIMITS_ENABLED=true

What do these changes do?

When a user runs lots of services on low resource machine, this one can get overloaded when the study is closed.
Usually iowait grows very fast and hits 100%, at this point that machine starts to having issues.

The idea of this PR is to lower the pressure on the resource usage when services are closed.

To achieve this:

  • for each node, a pool of slots is assigned
  • when the director-v2 saves the state of the services tries to take a slot from the pool
  • if it can take a slot, this will be hold during the entire save operation
    • if the process holding the locks dies, the locks expire after a set amount of time and can be reused
  • if the pool is full the director-v2 will try at the next observation cycle to save the state

Added environment variables:

  • DYNAMIC_SIDECAR_DOCKER_NODE_RESOURCE_LIMITS_ENABLED (default False) enable disable the feature
  • DYNAMIC_SIDECAR_DOCKER_NODE_CONCURRENT_SAVES (default 2) max concurrent saves for the feature
  • DYNAMIC_SIDECAR_DOCKER_NODE_SAVES_LOCK_TIMEOUT_S (default 10.0) redis lock extension timeout, when the timeout expires the lock is released

Related issue/s

How to test

Checklist

  • Unit tests for the changes exist
  • Runs in the swarm

@GitHK GitHK self-assigned this Jul 1, 2022
@codecov
Copy link

codecov bot commented Jul 1, 2022

Codecov Report

Merging #3160 (309e3ea) into master (75afd7f) will increase coverage by 0.0%.
The diff coverage is 90.9%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #3160    +/-   ##
=======================================
  Coverage    81.8%   81.8%            
=======================================
  Files         723     724     +1     
  Lines       31006   31148   +142     
  Branches     4013    4024    +11     
=======================================
+ Hits        25376   25493   +117     
- Misses       4813    4835    +22     
- Partials      817     820     +3     
Flag Coverage Δ
integrationtests 66.4% <88.1%> (+0.1%) ⬆️
unittests 78.3% <68.3%> (-0.1%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ules/dynamic_sidecar/docker_service_specs/proxy.py 100.0% <ø> (ø)
...tor_v2/modules/dynamic_sidecar/scheduler/events.py 90.4% <76.7%> (-2.0%) ⬇️
...simcore_service_director_v2/modules/node_rights.py 97.2% <97.2%> (ø)
...rc/simcore_service_director_v2/core/application.py 90.4% <100.0%> (+0.1%) ⬆️
...-v2/src/simcore_service_director_v2/core/errors.py 76.9% <100.0%> (+0.4%) ⬆️
...2/src/simcore_service_director_v2/core/settings.py 96.5% <100.0%> (+<0.1%) ⬆️
...or_v2/models/schemas/dynamic_services/scheduler.py 98.4% <100.0%> (+<0.1%) ⬆️
...ector_v2/modules/comp_scheduler/background_task.py 83.3% <0.0%> (-8.4%) ⬇️
...imcore_service_webserver/garbage_collector_core.py 66.2% <0.0%> (-3.1%) ⬇️
.../simcore_service_catalog/db/repositories/groups.py 72.9% <0.0%> (-2.8%) ⬇️
... and 7 more

@GitHK GitHK changed the title ♻️✨Adding distributed locking to throttle concurrent saves on nodes (⚠️ devops) ♻️✨Adding distributed locking to throttle concurrent saves on nodes Jul 1, 2022
@GitHK GitHK marked this pull request as ready for review July 1, 2022 16:15
@GitHK GitHK requested review from sanderegg and pcrespov as code owners July 1, 2022 16:15
@pcrespov pcrespov added this to the Diolkos milestone Jul 3, 2022
@pcrespov pcrespov added the a:director-v2 issue related with the director-v2 service label Jul 3, 2022


@dataclass
class RedisLockManager:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, let's review naming ...

Answer theses questions:

  1. what is the purpose of this class?
  2. what elements of this class must be hidden to the user?

E.g.

  1. Creates some sort of resource allocation system for each cluster node based on "slots". As any resource, "slots" are limited and need to be allocated/acquired when used and de-allocated/released when not needed anymore.

  2. does the user needs to know that redis is behind? Redis is the tool we use. Its interface is already wrapped in the attribute redis of this class. Is there a need to call the class Redis ... ?

Based on the drafted responses, I would rather call this class SlotsManager. Now, if this is still not descriptive enough, probably is because Slots is still too vague as a term

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with SlotsManager, even though it still lacks something.


async def _extend_task(self, lock: Lock) -> None:
while True:
await asyncio.sleep(self.lock_extend_interval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not totally understand this tuning to solve deadlocks. Moreover, what values do guarantee that deadlock will not exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could happen that the director-v2 acquires all the locks and it is killed. At this point nothing will any longer be able to release them.
If the locks are not released, the system will no longer be able to save any data.

To allow for the locks to be released they must be acquire with a timeout and a task, which extends their timeout at regular intervals must be created.

await lock.release()
assert await lock.locked() is False
# NOTE: apparently it mirrors the first lock instance!
assert await second_lock.locked() is False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These notes suggest that you are getting results you did not expect. Please check with @sanderegg how this library works. He has some experience with it.

Copy link
Contributor Author

@GitHK GitHK Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was unexpected and caused a big deal of confusion.

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while reviewing I am wondering why the locking mechanism is in the director-v2? Since you are anyway already implementing a long running operations mode, would it not make more sense to do the following:

Say we have node XX with 10 dy-sidecars:

  • they all know the hostname because it is defined directly using the docker swarm auto-filled IDs
  • they connect to redis on their own
  • we can discuss that

Then I have secondary questions:

  • Did you see the same effect on master where RClone is disabled?
  • Can you properly measure that this is going to effectively reduce the pressure? or are we building an artifical bottleneck?
  • what about download of data when the servces are starting?

@@ -93,5 +93,9 @@ async def redis_locks_client(
async def wait_till_redis_responsive(redis_url: Union[URL, str]) -> None:
client = from_url(f"{redis_url}", encoding="utf-8", decode_responses=True)

if not await client.ping():
raise ConnectionError(f"{redis_url=} not available")
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to flushall here? it is just a ping...
would it not make more sense to separate all this, or even use the sync client in that case? just in a context manager...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to flushall, you are correct. The issue is with this close_connection_pool argument which is not supplied when using the context manger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so just use the synchronous client. we do not need async here.

logger.warning("%s", f"{err}")

logger.info("Ports data pushed by dynamic-sidecar")
await dynamic_sidecar_client.begin_service_destruction(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a long running operation?
I just realised that the lock is owned by the director-v2?? why is that?
Why not the dynamic-sidecar directly? especially since it knows on which hostname it lies, the locking would be much more simpler right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a long running operation?

Yes, there are two very long running operations and several shorter ones but still lengthy.

I just realised that the lock is owned by the director-v2?? why is that?

This is the correct place to lock since:

  • the entire block needs to sit under a lock.
  • the dynamic scheduler knows best when to save the state. I would not push this responsibility on the sidecar. I would not leave it to each individual sidecar. The current design is as less responsibility as possible to the dynamic-sidecar.
  • To correctly check and acquire and release the lock you need to use the same instance.

Why not the dynamic-sidecar directly? especially since it knows on which hostname it lies, the locking would be much more simpler right?

  • locking here is more complex to deal with, since the lock needs to be used across different APIs
  • you always need to use the initial distance of the lock to acquire/release it (it must be kept in the app's state)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that locking in the dy-sidecar would be way simpler. and also would prevent having so much logic that the director-v2 needs to have.
Just look at real life, the director never knows what the employee is doing exactly. the employee knows best when to lock his drawers.
WEll anyway, this should be moved to the new design when and if it comes.

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so we can disable that thing in case it goes south.

Nevertheless I still think it is wrong that the director-v2 takes all these responsabilities. This should all be shifted to the sidecar (same with docker-compose up/down, etc...). It would simplify everything by a lot. Let's keep this in mind for later refactoring.

logger.warning("%s", f"{err}")

logger.info("Ports data pushed by dynamic-sidecar")
await dynamic_sidecar_client.begin_service_destruction(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that locking in the dy-sidecar would be way simpler. and also would prevent having so much logic that the director-v2 needs to have.
Just look at real life, the director never knows what the employee is doing exactly. the employee knows best when to lock his drawers.
WEll anyway, this should be moved to the new design when and if it comes.

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some doc

@GitHK GitHK requested a review from pcrespov July 12, 2022 09:19
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pair reviewed
suggest to change "slotsmanager" by something that suggest "acquire the rights to use a docker swarm node"

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.3% 0.3% Duplication

@GitHK GitHK merged commit 44b5f4e into ITISFoundation:master Jul 12, 2022
@GitHK GitHK deleted the pr-osparc-batch-save branch July 12, 2022 13:26
@GitHK GitHK requested review from mrnicegyu11 and Surfict July 13, 2022 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:director-v2 issue related with the director-v2 service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants