-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(⚠️ devops) ♻️✨Adding distributed locking to throttle concurrent saves on nodes #3160
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3160 +/- ##
=======================================
Coverage 81.8% 81.8%
=======================================
Files 723 724 +1
Lines 31006 31148 +142
Branches 4013 4024 +11
=======================================
+ Hits 25376 25493 +117
- Misses 4813 4835 +22
- Partials 817 820 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
services/director-v2/src/simcore_service_director_v2/core/application.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/redis.py
Outdated
Show resolved
Hide resolved
|
||
|
||
@dataclass | ||
class RedisLockManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, let's review naming ...
Answer theses questions:
- what is the purpose of this class?
- what elements of this class must be hidden to the user?
E.g.
-
Creates some sort of resource allocation system for each cluster node based on "slots". As any resource, "slots" are limited and need to be allocated/acquired when used and de-allocated/released when not needed anymore.
-
does the user needs to know that redis is behind? Redis is the tool we use. Its interface is already wrapped in the attribute
redis
of this class. Is there a need to call the class Redis ... ?
Based on the drafted responses, I would rather call this class SlotsManager
. Now, if this is still not descriptive enough, probably is because Slots
is still too vague as a term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with SlotsManager
, even though it still lacks something.
services/director-v2/src/simcore_service_director_v2/modules/redis.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/redis.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/redis.py
Outdated
Show resolved
Hide resolved
|
||
async def _extend_task(self, lock: Lock) -> None: | ||
while True: | ||
await asyncio.sleep(self.lock_extend_interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not totally understand this tuning to solve deadlocks. Moreover, what values do guarantee that deadlock will not exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could happen that the director-v2
acquires all the locks and it is killed. At this point nothing will any longer be able to release them.
If the locks are not released, the system will no longer be able to save any data.
To allow for the locks to be released they must be acquire with a timeout and a task, which extends their timeout at regular intervals must be created.
await lock.release() | ||
assert await lock.locked() is False | ||
# NOTE: apparently it mirrors the first lock instance! | ||
assert await second_lock.locked() is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These notes suggest that you are getting results you did not expect. Please check with @sanderegg how this library works. He has some experience with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was unexpected and caused a big deal of confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So while reviewing I am wondering why the locking mechanism is in the director-v2? Since you are anyway already implementing a long running operations mode, would it not make more sense to do the following:
Say we have node XX with 10 dy-sidecars:
- they all know the hostname because it is defined directly using the docker swarm auto-filled IDs
- they connect to redis on their own
- we can discuss that
Then I have secondary questions:
- Did you see the same effect on master where RClone is disabled?
- Can you properly measure that this is going to effectively reduce the pressure? or are we building an artifical bottleneck?
- what about download of data when the servces are starting?
@@ -93,5 +93,9 @@ async def redis_locks_client( | |||
async def wait_till_redis_responsive(redis_url: Union[URL, str]) -> None: | |||
client = from_url(f"{redis_url}", encoding="utf-8", decode_responses=True) | |||
|
|||
if not await client.ping(): | |||
raise ConnectionError(f"{redis_url=} not available") | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to flushall here? it is just a ping...
would it not make more sense to separate all this, or even use the sync client in that case? just in a context manager...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to flushall, you are correct. The issue is with this close_connection_pool
argument which is not supplied when using the context manger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so just use the synchronous client. we do not need async here.
...ces/director-v2/src/simcore_service_director_v2/models/schemas/dynamic_services/scheduler.py
Outdated
Show resolved
Hide resolved
...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py
Outdated
Show resolved
Hide resolved
...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py
Outdated
Show resolved
Hide resolved
...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py
Outdated
Show resolved
Hide resolved
logger.warning("%s", f"{err}") | ||
|
||
logger.info("Ports data pushed by dynamic-sidecar") | ||
await dynamic_sidecar_client.begin_service_destruction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a long running operation?
I just realised that the lock is owned by the director-v2?? why is that?
Why not the dynamic-sidecar directly? especially since it knows on which hostname it lies, the locking would be much more simpler right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a long running operation?
Yes, there are two very long running operations and several shorter ones but still lengthy.
I just realised that the lock is owned by the director-v2?? why is that?
This is the correct place to lock since:
- the entire block needs to sit under a lock.
- the dynamic scheduler knows best when to save the state. I would not push this responsibility on the sidecar. I would not leave it to each individual sidecar. The current design is as less responsibility as possible to the dynamic-sidecar.
- To correctly check and acquire and release the lock you need to use the same instance.
Why not the dynamic-sidecar directly? especially since it knows on which hostname it lies, the locking would be much more simpler right?
- locking here is more complex to deal with, since the lock needs to be used across different APIs
- you always need to use the initial distance of the lock to acquire/release it (it must be kept in the app's state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that locking in the dy-sidecar would be way simpler. and also would prevent having so much logic that the director-v2 needs to have.
Just look at real life, the director never knows what the employee is doing exactly. the employee knows best when to lock his drawers.
WEll anyway, this should be moved to the new design when and if it comes.
...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/redis.py
Outdated
Show resolved
Hide resolved
refactor usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so we can disable that thing in case it goes south.
Nevertheless I still think it is wrong that the director-v2 takes all these responsabilities. This should all be shifted to the sidecar (same with docker-compose up/down, etc...). It would simplify everything by a lot. Let's keep this in mind for later refactoring.
logger.warning("%s", f"{err}") | ||
|
||
logger.info("Ports data pushed by dynamic-sidecar") | ||
await dynamic_sidecar_client.begin_service_destruction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that locking in the dy-sidecar would be way simpler. and also would prevent having so much logic that the director-v2 needs to have.
Just look at real life, the director never knows what the employee is doing exactly. the employee knows best when to lock his drawers.
WEll anyway, this should be moved to the new design when and if it comes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some doc
services/director-v2/src/simcore_service_director_v2/modules/redis.py
Outdated
Show resolved
Hide resolved
...ices/director-v2/src/simcore_service_director_v2/modules/dynamic_sidecar/scheduler/events.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pair reviewed
suggest to change "slotsmanager" by something that suggest "acquire the rights to use a docker swarm node"
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Redis
is now used by thedirector-v2
. Please check that the following env vars are passed to thedirector-v2
:REDIS_HOST
REDIS_PORT
REDIS_USER
REDIS_PASSWORD
Only on
AWS STAGING
setDYNAMIC_SIDECAR_DOCKER_NODE_RESOURCE_LIMITS_ENABLED=true
What do these changes do?
When a user runs lots of services on low resource machine, this one can get overloaded when the study is closed.
Usually iowait grows very fast and hits 100%, at this point that machine starts to having issues.
The idea of this PR is to lower the pressure on the resource usage when services are closed.
To achieve this:
slots
is assignedAdded environment variables:
DYNAMIC_SIDECAR_DOCKER_NODE_RESOURCE_LIMITS_ENABLED
(defaultFalse
) enable disable the featureDYNAMIC_SIDECAR_DOCKER_NODE_CONCURRENT_SAVES
(default2
) max concurrent saves for the featureDYNAMIC_SIDECAR_DOCKER_NODE_SAVES_LOCK_TIMEOUT_S
(default10.0
) redis lock extension timeout, when the timeout expires the lock is releasedRelated issue/s
How to test
Checklist