(⚠️ devops) ♻️✨Adding distributed locking to throttle concurrent saves on nodes #3160

GitHK · 2022-07-01T12:01:07Z

⚠️ devops

Redis is now used by the director-v2. Please check that the following env vars are passed to the director-v2:

REDIS_HOST
REDIS_PORT
REDIS_USER
REDIS_PASSWORD

Only on AWS STAGING set DYNAMIC_SIDECAR_DOCKER_NODE_RESOURCE_LIMITS_ENABLED=true

What do these changes do?

When a user runs lots of services on low resource machine, this one can get overloaded when the study is closed.
Usually iowait grows very fast and hits 100%, at this point that machine starts to having issues.

The idea of this PR is to lower the pressure on the resource usage when services are closed.

To achieve this:

for each node, a pool of slots is assigned
when the director-v2 saves the state of the services tries to take a slot from the pool
if it can take a slot, this will be hold during the entire save operation
- if the process holding the locks dies, the locks expire after a set amount of time and can be reused
if the pool is full the director-v2 will try at the next observation cycle to save the state

Added environment variables:

DYNAMIC_SIDECAR_DOCKER_NODE_RESOURCE_LIMITS_ENABLED (default False) enable disable the feature
DYNAMIC_SIDECAR_DOCKER_NODE_CONCURRENT_SAVES (default 2) max concurrent saves for the feature
DYNAMIC_SIDECAR_DOCKER_NODE_SAVES_LOCK_TIMEOUT_S (default 10.0) redis lock extension timeout, when the timeout expires the lock is released

Related issue/s

improving dynamic-sidecar design osparc-issues#638
partially addresses multiple zipping on same machine causes high I/O #3143 by limiting the iowait since saves are queued per node

How to test

Checklist

Unit tests for the changes exist
Runs in the swarm

codecov · 2022-07-01T12:02:35Z

Codecov Report

Merging #3160 (309e3ea) into master (75afd7f) will increase coverage by 0.0%.
The diff coverage is 90.9%.

@@           Coverage Diff           @@
##           master   #3160    +/-   ##
=======================================
  Coverage    81.8%   81.8%            
=======================================
  Files         723     724     +1     
  Lines       31006   31148   +142     
  Branches     4013    4024    +11     
=======================================
+ Hits        25376   25493   +117     
- Misses       4813    4835    +22     
- Partials      817     820     +3

Flag	Coverage Δ
integrationtests	`66.4% <88.1%> (+0.1%)`	⬆️
unittests	`78.3% <68.3%> (-0.1%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ules/dynamic_sidecar/docker_service_specs/proxy.py	`100.0% <ø> (ø)`
...tor_v2/modules/dynamic_sidecar/scheduler/events.py	`90.4% <76.7%> (-2.0%)`	⬇️
...simcore_service_director_v2/modules/node_rights.py	`97.2% <97.2%> (ø)`
...rc/simcore_service_director_v2/core/application.py	`90.4% <100.0%> (+0.1%)`	⬆️
...-v2/src/simcore_service_director_v2/core/errors.py	`76.9% <100.0%> (+0.4%)`	⬆️
...2/src/simcore_service_director_v2/core/settings.py	`96.5% <100.0%> (+<0.1%)`	⬆️
...or_v2/models/schemas/dynamic_services/scheduler.py	`98.4% <100.0%> (+<0.1%)`	⬆️
...ector_v2/modules/comp_scheduler/background_task.py	`83.3% <0.0%> (-8.4%)`	⬇️
...imcore_service_webserver/garbage_collector_core.py	`66.2% <0.0%> (-3.1%)`	⬇️
.../simcore_service_catalog/db/repositories/groups.py	`72.9% <0.0%> (-2.8%)`	⬇️
... and 7 more

services/director-v2/src/simcore_service_director_v2/core/application.py

services/director-v2/tests/unit/with_dbs/test_modules_redis.py

services/director-v2/src/simcore_service_director_v2/modules/redis.py

pcrespov · 2022-07-03T12:59:33Z

services/director-v2/src/simcore_service_director_v2/modules/redis.py

+
+
+@dataclass
+class RedisLockManager:


Please, let's review naming ...

Answer theses questions:

what is the purpose of this class?

what elements of this class must be hidden to the user?

E.g.

Creates some sort of resource allocation system for each cluster node based on "slots". As any resource, "slots" are limited and need to be allocated/acquired when used and de-allocated/released when not needed anymore.

does the user needs to know that redis is behind? Redis is the tool we use. Its interface is already wrapped in the attribute redis of this class. Is there a need to call the class Redis ... ?

Based on the drafted responses, I would rather call this class SlotsManager. Now, if this is still not descriptive enough, probably is because Slots is still too vague as a term

Went with SlotsManager, even though it still lacks something.

services/director-v2/src/simcore_service_director_v2/modules/redis.py

services/director-v2/tests/unit/with_dbs/test_modules_redis.py

pcrespov · 2022-07-03T13:41:41Z

services/director-v2/src/simcore_service_director_v2/modules/redis.py

+
+    async def _extend_task(self, lock: Lock) -> None:
+        while True:
+            await asyncio.sleep(self.lock_extend_interval)


Sorry, I do not totally understand this tuning to solve deadlocks. Moreover, what values do guarantee that deadlock will not exist?

It could happen that the director-v2 acquires all the locks and it is killed. At this point nothing will any longer be able to release them.
If the locks are not released, the system will no longer be able to save any data.

To allow for the locks to be released they must be acquire with a timeout and a task, which extends their timeout at regular intervals must be created.

pcrespov · 2022-07-03T13:45:07Z

services/director-v2/tests/unit/with_dbs/test_modules_redis.py

+    await lock.release()
+    assert await lock.locked() is False
+    # NOTE: apparently it mirrors the first lock instance!
+    assert await second_lock.locked() is False


These notes suggest that you are getting results you did not expect. Please check with @sanderegg how this library works. He has some experience with it.

Yes, this was unexpected and caused a big deal of confusion.

sanderegg

So while reviewing I am wondering why the locking mechanism is in the director-v2? Since you are anyway already implementing a long running operations mode, would it not make more sense to do the following:

Say we have node XX with 10 dy-sidecars:

they all know the hostname because it is defined directly using the docker swarm auto-filled IDs
they connect to redis on their own
we can discuss that

Then I have secondary questions:

Did you see the same effect on master where RClone is disabled?
Can you properly measure that this is going to effectively reduce the pressure? or are we building an artifical bottleneck?
what about download of data when the servces are starting?

sanderegg · 2022-07-03T20:21:22Z

packages/pytest-simcore/src/pytest_simcore/redis_service.py

@@ -93,5 +93,9 @@ async def redis_locks_client(
 async def wait_till_redis_responsive(redis_url: Union[URL, str]) -> None:
    client = from_url(f"{redis_url}", encoding="utf-8", decode_responses=True)

-    if not await client.ping():
-        raise ConnectionError(f"{redis_url=} not available")
+    try:


Why do you need to flushall here? it is just a ping...
would it not make more sense to separate all this, or even use the sync client in that case? just in a context manager...

No need to flushall, you are correct. The issue is with this close_connection_pool argument which is not supplied when using the context manger.

so just use the synchronous client. we do not need async here.

...ces/director-v2/src/simcore_service_director_v2/models/schemas/dynamic_services/scheduler.py

...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py

sanderegg · 2022-07-03T20:32:13Z

...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py

-                    logger.warning("%s", f"{err}")
-
-                logger.info("Ports data pushed by dynamic-sidecar")
+                await dynamic_sidecar_client.begin_service_destruction(


is this a long running operation?
I just realised that the lock is owned by the director-v2?? why is that?
Why not the dynamic-sidecar directly? especially since it knows on which hostname it lies, the locking would be much more simpler right?

is this a long running operation?

Yes, there are two very long running operations and several shorter ones but still lengthy.

I just realised that the lock is owned by the director-v2?? why is that?

This is the correct place to lock since:

the entire block needs to sit under a lock.

the dynamic scheduler knows best when to save the state. I would not push this responsibility on the sidecar. I would not leave it to each individual sidecar. The current design is as less responsibility as possible to the dynamic-sidecar.

To correctly check and acquire and release the lock you need to use the same instance.

Why not the dynamic-sidecar directly? especially since it knows on which hostname it lies, the locking would be much more simpler right?

locking here is more complex to deal with, since the lock needs to be used across different APIs

you always need to use the initial distance of the lock to acquire/release it (it must be kept in the app's state)

I still think that locking in the dy-sidecar would be way simpler. and also would prevent having so much logic that the director-v2 needs to have.
Just look at real life, the director never knows what the employee is doing exactly. the employee knows best when to lock his drawers.
WEll anyway, this should be moved to the new design when and if it comes.

...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py

services/director-v2/src/simcore_service_director_v2/modules/redis.py

refactor usage

sanderegg

ok, so we can disable that thing in case it goes south.

Nevertheless I still think it is wrong that the director-v2 takes all these responsabilities. This should all be shifted to the sidecar (same with docker-compose up/down, etc...). It would simplify everything by a lot. Let's keep this in mind for later refactoring.

sanderegg · 2022-07-05T14:45:56Z

...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py

-                    logger.warning("%s", f"{err}")
-
-                logger.info("Ports data pushed by dynamic-sidecar")
+                await dynamic_sidecar_client.begin_service_destruction(


I still think that locking in the dy-sidecar would be way simpler. and also would prevent having so much logic that the director-v2 needs to have.
Just look at real life, the director never knows what the employee is doing exactly. the employee knows best when to lock his drawers.
WEll anyway, this should be moved to the new design when and if it comes.

pcrespov

some doc

services/director-v2/src/simcore_service_director_v2/modules/redis.py

...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py

pcrespov

pair reviewed
suggest to change "slotsmanager" by something that suggest "acquire the rights to use a docker swarm node"

sonarqubecloud · 2022-07-12T11:42:23Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.3% Duplication

Andrei Neagu added 3 commits July 1, 2022 11:54

proerly closing connection

d60b339

added redis requirement

abddbd2

added redis module to handle locks

8dc4f46

GitHK self-assigned this Jul 1, 2022

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

7a37f83

Andrei Neagu added 2 commits July 1, 2022 15:06

test lock extension exipration

9488b4c

added new option

b537099

GitHK changed the title ~~♻️✨Adding distributed locking to throttle concurrent saves on nodes~~ (⚠️ devops) ♻️✨Adding distributed locking to throttle concurrent saves on nodes Jul 1, 2022

Andrei Neagu added 3 commits July 1, 2022 17:58

services now close in batches

38dfce0

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

2441683

pylint

e2da05b

GitHK marked this pull request as ready for review July 1, 2022 16:15

GitHK requested review from sanderegg and pcrespov as code owners July 1, 2022 16:15

pcrespov added this to the Diolkos milestone Jul 3, 2022

pcrespov added the a:director-v2 issue related with the director-v2 service label Jul 3, 2022

pcrespov reviewed Jul 3, 2022

View reviewed changes

sanderegg reviewed Jul 3, 2022

View reviewed changes

GitHK mentioned this pull request Jul 4, 2022

improving dynamic-sidecar design ITISFoundation/osparc-issues#638

Open

Andrei Neagu added 10 commits July 4, 2022 12:49

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

b4a6b66

fixed integration tests

d406deb

dropped condition since redis is mandatory

8b4000c

removed LocksPerNodeProvider

3521b1e

moving from with-dbs

a972803

rename key composition

67f3377

renamed dynamic_sidecar_node_id to docker_node_id

8c7368a

refactor usage

type refactor

4045442

renamed to lock_timeout_s to account for the unit

c1806e8

renamed tests

aa3df36

sanderegg approved these changes Jul 5, 2022

View reviewed changes

Andrei Neagu added 2 commits July 5, 2022 17:06

fix integration tests

b4ace94

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

373c358

pcrespov approved these changes Jul 6, 2022

View reviewed changes

services/director-v2/src/simcore_service_director_v2/modules/redis.py Outdated Show resolved Hide resolved

...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py Outdated Show resolved Hide resolved

Andrei Neagu added 13 commits July 6, 2022 09:33

limit concurrency

a458bbd

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

c175f59

added test to test for lock raising error

f56bfa7

expanded the concept of slots to include resources

9c0efff

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

753b2a0

rough scripts for local testing

5a2e866

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

dbba30e

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

792cabe

turn off by default

9927380

refactor and update docs

b4522e0

sharing same node lock for open/close

ccf19e6

always test the feature

275147a

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

26c9b66

GitHK requested a review from pcrespov July 12, 2022 09:19

pcrespov approved these changes Jul 12, 2022

View reviewed changes

Andrei Neagu added 6 commits July 12, 2022 11:54

Merge remote-tracking branch 'upstream/master' into pr-osparc-batch-save

d1d17f5

if there are many logs

8cfafe3

renaming settings

0a7ca76

more module renaming

2c9b032

renamed lock to acquire

627f0ac

renamed redis to node_rights

309e3ea

GitHK merged commit 44b5f4e into ITISFoundation:master Jul 12, 2022

GitHK deleted the pr-osparc-batch-save branch July 12, 2022 13:26

GitHK requested review from mrnicegyu11 and Surfict July 13, 2022 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(⚠️ devops) ♻️✨Adding distributed locking to throttle concurrent saves on nodes #3160

(⚠️ devops) ♻️✨Adding distributed locking to throttle concurrent saves on nodes #3160

GitHK commented Jul 1, 2022 •

edited

Loading

codecov bot commented Jul 1, 2022 •

edited

Loading

pcrespov Jul 3, 2022

GitHK Jul 5, 2022

pcrespov Jul 3, 2022

GitHK Jul 5, 2022

pcrespov Jul 3, 2022

GitHK Jul 4, 2022 •

edited

Loading

sanderegg left a comment

sanderegg Jul 3, 2022

GitHK Jul 4, 2022

sanderegg Jul 5, 2022

sanderegg Jul 3, 2022

GitHK Jul 4, 2022

sanderegg Jul 5, 2022

sanderegg left a comment

sanderegg Jul 5, 2022

pcrespov left a comment

pcrespov left a comment

sonarqubecloud bot commented Jul 12, 2022

(⚠️ devops) ♻️✨Adding distributed locking to throttle concurrent saves on nodes #3160

(⚠️ devops) ♻️✨Adding distributed locking to throttle concurrent saves on nodes #3160

Conversation

GitHK commented Jul 1, 2022 • edited Loading

⚠️ devops

What do these changes do?

Related issue/s

How to test

Checklist

codecov bot commented Jul 1, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GitHK Jul 4, 2022 • edited Loading

Choose a reason for hiding this comment

sanderegg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanderegg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcrespov left a comment

Choose a reason for hiding this comment

pcrespov left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Jul 12, 2022

GitHK commented Jul 1, 2022 •

edited

Loading

codecov bot commented Jul 1, 2022 •

edited

Loading

GitHK Jul 4, 2022 •

edited

Loading