Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

βœ¨πŸ› making dy-sidecar resilient to restarts #3239

Merged
merged 73 commits into from
Aug 19, 2022

Conversation

GitHK
Copy link
Contributor

@GitHK GitHK commented Aug 9, 2022

What do these changes do?

In short

  • Allows the dynamic-sidecar to restart and recover the previous status it had.
  • If the dynamic-sidecar is restarted it will reattach itself to its previous volumes and will be able to save the service's data.
  • This also means the service can be restarted from Portainer and it will still behave as expected.

New volume handling

Changes how volumes, attached to the dynamic-sidecar, are handled.

  • Volumes, associated to the runtime of the dy-sidecar, are no longer automatically removed by the docker swarm when the dynamic-sidecar is container is removed.
  • These volumes are removed by the director-v2 when after after the dy-sdidecar service is removed. If there was an issue with the dy-sidecar (killed, restarted or manually removed from Portainer) the volumes remain on the node.
  • Created volumes will be unique, between runs. It implies that when the same service (from inside a project) is started, it will receive different persistent volumes. This minimises the likelihood of any data loss. In the worst case scenario the volumes will remain on the node (manual cleanup will be required).

What will I see on the node?

When starting a service( for example: based on jupyter-math), with node_uuid=4e600b06-bc61-4c6a-9ff8-c6df9dbd27a3, the following volumes will be created by the dy-sidecar:

dyv_5ef3b2ef-e7d1-421a-945c-18d8ed63d19f_home_jovyan_work_inputs_4e600b06-bc61-4c6a-9ff8-c6df9dbd27a3
dyv_5ef3b2ef-e7d1-421a-945c-18d8ed63d19f_home_jovyan_work_outputs_4e600b06-bc61-4c6a-9ff8-c6df9dbd27a3
dyv_5ef3b2ef-e7d1-421a-945c-18d8ed63d19f_home_jovyan_work_workspace_4e600b06-bc61-4c6a-9ff8-c6df9dbd27a3
dyv_5ef3b2ef-e7d1-421a-945c-18d8ed63d19f_shared-store_4e600b06-bc61-4c6a-9ff8-c6df9dbd27a3

The naming of the volumes follows the following rule dyv_{RUN_ID}{NORMLIZED_FOLDER_PATH}_{NODE_UUID} where:

  • dyv a simple identifiable prefix
  • RUN_ID is a UUID generate by the director-v2 when starting the dynamic-sidecar. It is used to uniquely identify different runs of the service. This allows the service (between restarts) to not attach itself to older data (maybe out of date).
  • NORMLIZED_FOLDER_PATH obtained by replacing the / with _ in the folder path
  • NODE_UUID the service's unique identifier inside the project

Changes

  • refactored docker_api inside director-v2's dynamic-sidecar module (500+ lines of code)
  • use uniquely named docker volumes instead of just named volumes.
  • director-v2 can remove volumes from target node
  • add local volume for sidecar where to keep the compose spec so it can recover after boot and know which containers it is suppose to monitor
    • add new volume when creating sidecar & put it in the volume cleanup
    • refactor task and events to share the handler for removing volumes
  • finally properly addressed flakiness with the state_exclude inside director-v2 tests
  • added cleanup task to handle dy-volrm_{NODE_UUID} services which could remain pending in the system in case of error.

Checks

  • if sidecar is killed (docker kill) , it is stil able to save the state of the service
  • restarting service always works, even if volumes are pending on the node
  • when node leaves the swarm volumes remain on the node
    • test on master when node gets separated from the cluster with @mrnicegyu11

Related issue/s

How to test

Checklist

  • Unit tests for the changes exist
  • Runs in the swarm

@codecov
Copy link

codecov bot commented Aug 9, 2022

Codecov Report

Merging #3239 (600ebe3) into master (465f49d) will decrease coverage by 0.7%.
The diff coverage is 84.3%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #3239     +/-   ##
========================================
- Coverage    81.6%   80.9%   -0.8%     
========================================
  Files         753     759      +6     
  Lines       32205   32392    +187     
  Branches     4115    4128     +13     
========================================
- Hits        26309   26214     -95     
- Misses       5040    5295    +255     
- Partials      856     883     +27     
Flag Coverage Ξ”
integrationtests 66.5% <81.5%> (+<0.1%) ⬆️
unittests 78.6% <93.6%> (+0.6%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Ξ”
...ector_v2/modules/dynamic_sidecar/scheduler/task.py 72.4% <61.5%> (-10.0%) ⬇️
...r_v2/modules/dynamic_sidecar/docker_api/_volume.py 66.1% <66.1%> (ΓΈ)
...ice_director_v2/modules/dynamic_sidecar/volumes.py 71.4% <85.7%> (ΓΈ)
...ore_service_dynamic_sidecar/models/shared_store.py 88.8% <86.6%> (-11.2%) ⬇️
...2/modules/dynamic_sidecar/scheduler/_task_utils.py 90.4% <90.4%> (ΓΈ)
...tor_v2/modules/dynamic_sidecar/docker_api/_core.py 80.5% <92.3%> (ΓΈ)
...or_v2/modules/dynamic_sidecar/docker_api/_utils.py 93.3% <93.3%> (ΓΈ)
...mic_sidecar/docker_service_specs/volume_remover.py 93.3% <93.3%> (ΓΈ)
...ages/models-library/src/models_library/services.py 93.0% <100.0%> (+0.1%) ⬆️
...ges/service-library/src/servicelib/docker_utils.py 100.0% <100.0%> (ΓΈ)
... and 75 more

@GitHK GitHK self-assigned this Aug 9, 2022
@GitHK GitHK added this to the Brutalism milestone Aug 9, 2022
@GitHK GitHK added the a:director-v2 issue related with the director-v2 service label Aug 9, 2022
@GitHK GitHK changed the title βœ¨πŸ› removal of dy-sidecar volumes from nodes βœ¨πŸ› making dy-sidecar more resilient to restarts Aug 11, 2022
@GitHK GitHK changed the title βœ¨πŸ› making dy-sidecar more resilient to restarts βœ¨πŸ› making dy-sidecar resilient to restarts Aug 11, 2022
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!Β  Β  Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
0.3% 0.3% Duplication

@GitHK GitHK merged commit 25e7ca9 into ITISFoundation:master Aug 19, 2022
@GitHK GitHK deleted the pr-osparc-named-volumes branch August 19, 2022 17:34
pcrespov added a commit to pcrespov/osparc-simcore that referenced this pull request Aug 22, 2022
pcrespov added a commit to pcrespov/osparc-simcore that referenced this pull request Aug 22, 2022
pcrespov added a commit that referenced this pull request Aug 22, 2022
* Revert "βœ¨πŸ› making dy-sidecar resilient to restarts (#3239)"

This reverts commit 25e7ca9.

* fixes bad merge
GitHK added a commit that referenced this pull request Sep 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:director-v2 issue related with the director-v2 service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dy-sidecar reboot volume/data issues
3 participants