Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 fixing randomly failing [sys] deploy simcore #2430

Closed
wants to merge 15 commits into from

Conversation

GitHK
Copy link
Contributor

@GitHK GitHK commented Jul 9, 2021

What do these changes do?

Issue 1

When the [sys] deploy simcore runs, it starts the CI and checks that all the docker services are started.
Because docker-swarm has no way to tell if the application is ready, wait-for is used to ensure all application components are booted before starting other components.
Most importantly PostgresSQL is required to be running and all migrations must be applied.

Changes:

  • migration service will start an http server once it's ready
  • the following services will wait for migration to be ready before continuing to boot:
    • director-v2
    • director-v0
    • webserver
    • api-server
    • catalog
    • sidecar
    • storage
  • wait-for script was used and added to the repo; a test to ensure its present was also added

Issue 2

The test_core_service_running never took into consideration task slots. Each replica of a service is assigned a task slot. Multiple tasks may be present in the tasks list for each slot.
Addressing this issue fixed the randomly failing CI issues 🎉

Trivia: this PR will help mainly unlucky developers, some of us could have lived without it. I personally think that restarting the CI 18 times is not ideal.

Related issue/s

How to test

Checklist

@codecov
Copy link

codecov bot commented Jul 9, 2021

Codecov Report

Merging #2430 (55f5f10) into master (2e4c7f1) will decrease coverage by 4.2%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #2430     +/-   ##
========================================
- Coverage    76.0%   71.7%   -4.3%     
========================================
  Files         574     574             
  Lines       21629   21629             
  Branches     2077    2077             
========================================
- Hits        16445   15517    -928     
- Misses       4634    5613    +979     
+ Partials      550     499     -51     
Flag Coverage Δ
integrationtests 51.9% <ø> (-13.8%) ⬇️
unittests 69.9% <ø> (-0.1%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ackages/simcore-sdk/src/simcore_sdk/models/base.py 0.0% <0.0%> (-100.0%) ⬇️
...ges/simcore-sdk/src/simcore_sdk/models/__init__.py 0.0% <0.0%> (-100.0%) ⬇️
...core-sdk/src/simcore_sdk/models/pipeline_models.py 0.0% <0.0%> (-64.6%) ⬇️
...imcore_service_webserver/exporter/export_import.py 34.1% <0.0%> (-61.0%) ⬇️
..._director_v2/modules/db/repositories/comp_tasks.py 37.7% <0.0%> (-58.2%) ⬇️
...re-sdk/src/simcore_sdk/node_ports/serialization.py 24.6% <0.0%> (-57.0%) ⬇️
...c/simcore_service_director_v2/utils/async_utils.py 32.7% <0.0%> (-56.9%) ⬇️
...vice_webserver/exporter/formatters/formatter_v2.py 31.9% <0.0%> (-56.4%) ⬇️
...ore_service_director_v2/api/routes/computations.py 35.9% <0.0%> (-55.3%) ⬇️
...vice_webserver/exporter/formatters/formatter_v1.py 24.5% <0.0%> (-54.2%) ⬇️
... and 50 more

@GitHK GitHK added bug buggy, it does not work as expected t:maintenance Some planned maintenance work labels Jul 9, 2021
@GitHK GitHK added this to the Wombat milestone Jul 9, 2021
@GitHK GitHK self-assigned this Jul 9, 2021
@GitHK GitHK changed the title 🐛 fixing failing [sys] deploy simcore 🐛 fixing randomly failing [sys] deploy simcore Jul 21, 2021
@GitHK GitHK marked this pull request as ready for review July 21, 2021 08:29
Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! this will help. I still need to check the changes in the test. But some things:
Please check the Dockerfiles, you mostly don't need to install wget in both dev/prod mode but in the base stage.
Also, the sidecar most probably should not be part of the default network. principle of the least services in the network.

gosu nobody true
apt-get update; \
apt-get install -y gosu; \
rm -rf /var/lib/apt/lists/*; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if wget is needed in both dev/prod stages why don't you install it here?


ENTRYPOINT [ "/bin/sh", "services/api-server/docker/entrypoint.sh" ]
CMD ["/bin/sh", "services/api-server/docker/boot.sh"]
CMD ["/bin/sh", "-c", "wait-for http://migration:8000 -- services/api-server/docker/boot.sh"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to have an issue in the deployments here.
all the apps are renamed using $SWARM_STACK_NAME_migration. Please double-check and correct as this will fail for sure.

apt-get update; \
apt-get install -y gosu; \
rm -rf /var/lib/apt/lists/*; \
# verify that the binary works
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above and for all the subsequent Dockefiles


ENTRYPOINT [ "/bin/sh", "services/catalog/docker/entrypoint.sh" ]
CMD ["/bin/sh", "services/catalog/docker/boot.sh"]
CMD ["/bin/sh", "-c", "wait-for http://migration:8000 -- services/catalog/docker/boot.sh"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never too sure whether the -c is needed or not. and also please use the verbose option (for all the Dockerfiles)

@@ -158,6 +158,7 @@ services:
- SWARM_STACK_NAME=${SWARM_STACK_NAME:-simcore}
- SIDECAR_HOST_HOSTNAME_PATH=${SIDECAR_HOST_HOSTNAME_PATH:-/home/scu/hostname}
networks:
- default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need to put that service in the whole network (default) because of the wait-for then we should find another solution. We should prevent services from being in the default network if possible. Can't we have a wait-for for another service or find a different network construction? we can check the network/swarm diagrams.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after double-checking, is this also needed for the "normal" sidecar? if not then I'm ok with this.

@@ -215,6 +216,7 @@ services:
- SIDECAR_HOST_HOSTNAME_PATH=${SIDECAR_HOST_HOSTNAME_PATH:-/home/scu/hostname}
- TARGET_MPI_NODE_CPU_COUNT=${DEV_PC_CPU_COUNT:-0} # development computer CPU count, if env var is missing put to 0 to disable
networks:
- default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -153,6 +153,8 @@ services:
- DIRECTOR_V2_PORT=${DIRECTOR_V2_PORT:-8000}
- STORAGE_HOST=${STORAGE_HOST:-storage}
- STORAGE_PORT=${STORAGE_PORT:-8080}
- MIGRATION_HOST=${MIGRATION_HOST:-migration}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are not using that ENV no?

@@ -227,6 +229,7 @@ services:
START_AS_MODE_CPU: ${SIDECAR_FORCE_CPU_NODE:-0}

networks:
- default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so there is no reason to have the sidecar part of the default. Also, it does not need to access the postgres before the director sends computation tasks. so probably you do not need to have that change (and also no wait for from the sidecar).
Maybe you can even fine tune a bit these things?

@@ -227,6 +229,7 @@ services:
START_AS_MODE_CPU: ${SIDECAR_FORCE_CPU_NODE:-0}

networks:
- default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so there is no reason to have the sidecar part of the default. Also, it does not need to access the postgres before the director sends computation tasks. so probably you do not need to have that change (and also no wait for from the sidecar).
Maybe you can even fine tune a bit these things?

@@ -52,6 +52,7 @@ WORKDIR /build
# install only base 3rd party dependencies
COPY --chown=scu:scu packages/postgres-database/ .
RUN pip --no-cache-dir --quiet install .[migration]
RUN pip --no-cache-dir install fastapi uvicorn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using the usual requirements? please keep it the same for all services.

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the changes in the test.
I think the original idea of that test is that when we start the stack, all services shall start without erroring (even once). This test shall detect that problem.
I think the way you changed it will not detect this fact anymore right?

Also please be very strict with the terminology. Docker service -> tasks = slots = containers.

@@ -210,21 +244,26 @@ def get_tasks_summary(tasks):
return msg


def get_task_logs(task, service_name, docker_client):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_task_logs(task, service_name, docker_client):
def _get_task_logs(task, service_name, docker_client):

task = service.tasks()[i]
# assert number of slots equals number of expected services
tasks = service.tasks()
slots: Set[int] = {t["Slot"] for t in tasks}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you double-check:
I think this test is meant to test that the services are started in one trial. they should not be restarting several times until they can run.
And can you explain what is precisely a task slot? as far as I know, the service spawns a task that runs a container. if that container goes down/fails/completes, then the service will start another task. So what is a slot? if a slot == task then I would not change the terminology cause this will cause confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically if we want to have 8 sidecars, there should be 8 tasks, not anything else. If that is the case then we have something broken in the sidecar that needs fixing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, let me address the tasks and slots difference.

Each time a new service is started a new task for it is created for each replica. Each task has a unique slot assigned to it. This slot is an integer (usually you see it in the task name).
What they never say in the documentation (and is not obvious at all), if a task fails (regardless the of the reason), there will be multiple entries for the tasks with the same slot id.

To better explain see the below example. Start by defining a service with an image which dose not exist.

docker service create --replicas 1 --name aservice missingimage:latest2

To understand what is happening you actually need to use the --no-trunc option like below. Wait a few seconds before running the below command:

docker service ps --no-trunc aservice

The output will be similar to the following:

ID                          NAME             IMAGE                  NODE         DESIRED STATE   CURRENT STATE             ERROR                                   PORTS
44oiom4ba6rceyydxclrvaf9h   aservice.1       missingimage:latest2   neagu-wkst   Ready           Rejected 2 seconds ago    "No such image: missingimage:latest2"
lh15jmwbgmdj1k9xs2zugc8mj    \_ aservice.1   missingimage:latest2   neagu-wkst   Shutdown        Rejected 7 seconds ago    "No such image: missingimage:latest2"
lo0e312k39se79ljdi8d9m1v5    \_ aservice.1   missingimage:latest2   neagu-wkst   Shutdown        Rejected 12 seconds ago   "No such image: missingimage:latest2"
qzfnxfskeqq0swqvrte57ikxm    \_ aservice.1   missingimage:latest2   neagu-wkst   Shutdown        Rejected 17 seconds ago   "No such image: missingimage:latest2"
bt7z3ykrn3r9y08t37qrwgtwi    \_ aservice.1   missingimage:latest2   neagu-wkst   Shutdown        Rejected 22 seconds ago   "No such image: missingimage:latest2"

The information is not all that clear, but it goes as follows:

  • the second line actually refers to the service itself and it states in what state the service is desired to be
  • lines 3-6 all contain a Task named \_ aservice.1, but all have different IDs, each of them complains that the image is missing (which is to be expected)
  • so lines 3-6 actually refer to the same Task and have the same slot number but are in different states.
  • when the tasks are recovered tasks = service.tasks() all the above will be returned, in this case, 4 tasks will be returned
  • Currently, in the CI, when Posgress is unreachable there are 2 (or more) entries for the same task and slot id. One or more are in "Failed" and one is in "Running". Because the order of in which these tasks are listed is not guaranteed, you get random flaky CI tests and get to restart the CI countless times.

In conclusion:

  • you can have more than one task listed tasks = service.tasks() for the same task/slot
  • if there are 8 replicas, there might be more than 8 tasks listed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to rename some of the variables inside the function for clarity, some of them are not very well named

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so I think there is a small confusion there.
I did exactly as what you said but replaced replicas=1 by replicas=4:

and I obtained this (by the way, the no-trunc param only changes the ID from short to long so it makes no difference):

docker service ps aservice
ID             NAME             IMAGE                 NODE                   DESIRED STATE   CURRENT STATE             ERROR                              PORTS
y3zjnendm6ip   aservice.1       missingimage:latest   sanderegg-All-Series   Ready           Rejected 2 seconds ago    "No such image: missingimage:l…"   
tm4enmgl81za    \_ aservice.1   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 6 seconds ago    "No such image: missingimage:l…"   
ie904nrd8gll    \_ aservice.1   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 12 seconds ago   "No such image: missingimage:l…"   
gmfelojsodpq    \_ aservice.1   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 14 seconds ago   "No such image: missingimage:l…"   
w6ngk0fjtclo   aservice.2       missingimage:latest   sanderegg-All-Series   Ready           Rejected 2 seconds ago    "No such image: missingimage:l…"   
u5vxw9lnkldv    \_ aservice.2   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 6 seconds ago    "No such image: missingimage:l…"   
pilyh86flkgj    \_ aservice.2   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 12 seconds ago   "No such image: missingimage:l…"   
fg8kfoekjkwy    \_ aservice.2   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 14 seconds ago   "No such image: missingimage:l…"   
z68h4c0d4e3p   aservice.3       missingimage:latest   sanderegg-All-Series   Ready           Rejected 2 seconds ago    "No such image: missingimage:l…"   
qv8ucjr9edji    \_ aservice.3   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 7 seconds ago    "No such image: missingimage:l…"   
yvl4u4y8z41h    \_ aservice.3   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 12 seconds ago   "No such image: missingimage:l…"   
yto0eghef6un    \_ aservice.3   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 14 seconds ago   "No such image: missingimage:l…"   
4m8obbbjbob3   aservice.4       missingimage:latest   sanderegg-All-Series   Ready           Rejected 2 seconds ago    "No such image: missingimage:l…"   
cjamc81eot17    \_ aservice.4   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 6 seconds ago    "No such image: missingimage:l…"   
h6xeuyaj13ww    \_ aservice.4   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 12 seconds ago   "No such image: missingimage:l…"   
dn45522r2ysx    \_ aservice.4   missingimage:latest   sanderegg-All-Series   Shutdown        Rejected 14 seconds ago   "No such image: missingimage:l…"

Doing so effectively creates a service with 4 tasks. that is what we can see as aservice.1, aservice.2, aservice.3 and aservice.4. Respectively on lines where there is no \_ sign.
The lines with the \_ signs are the history and are shutdown. Always. This is the same that you see in the Portainer actually.

So now to come back to that test. we normally have 8 replicas of the sidecar. so there should be exactly 8 tasks not more not less. if there more, this means the first ones failed to start and this is something we would like to avoid since we want to have a controlled startup.


assert slot_count == expected_replicas, (
f"Expected to have {expected_replicas} slot(s), "
f"instead {slot_count} slot(s) were found."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. Please check this article it says a slot is "analogous" to a task. and it is nowhere visible when issuing docker service ps.

]
return len(running_tasks) > 0

# check at least one service in per slot is in running mode else raise error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is very confusing. Please check the terminology.
A service -> creates a bunch of tasks (that are analogous to slots), which in fine are containers. not the contrary.

return len(running_tasks) > 0

# check at least one service in per slot is in running mode else raise error
tasks_by_slot: Dict[int, Deque[Dict[str, Any]]] = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does tasks_by_slot mean? since it is supposed to be the same (see article above)?

@GitHK GitHK deleted the fix-boot-order branch July 28, 2021 14:58
@GitHK GitHK restored the fix-boot-order branch July 28, 2021 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug buggy, it does not work as expected t:maintenance Some planned maintenance work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants