🐛 fixing randomly failing [sys] deploy simcore #2430

GitHK · 2021-07-09T09:41:22Z

What do these changes do?

Issue 1

When the [sys] deploy simcore runs, it starts the CI and checks that all the docker services are started.
Because docker-swarm has no way to tell if the application is ready, wait-for is used to ensure all application components are booted before starting other components.
Most importantly PostgresSQL is required to be running and all migrations must be applied.

Changes:

migration service will start an http server once it's ready
the following services will wait for migration to be ready before continuing to boot:
- director-v2
- director-v0
- webserver
- api-server
- catalog
- sidecar
- storage
wait-for script was used and added to the repo; a test to ensure its present was also added

Issue 2

The test_core_service_running never took into consideration task slots. Each replica of a service is assigned a task slot. Multiple tasks may be present in the tasks list for each slot.
Addressing this issue fixed the randomly failing CI issues 🎉

Trivia: this PR will help mainly unlucky developers, some of us could have lived without it. I personally think that restarting the CI 18 times is not ideal.

Related issue/s

How to test

Checklist

codecov · 2021-07-09T09:42:45Z

Codecov Report

Merging #2430 (55f5f10) into master (2e4c7f1) will decrease coverage by 4.2%.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           master   #2430     +/-   ##
========================================
- Coverage    76.0%   71.7%   -4.3%     
========================================
  Files         574     574             
  Lines       21629   21629             
  Branches     2077    2077             
========================================
- Hits        16445   15517    -928     
- Misses       4634    5613    +979     
+ Partials      550     499     -51

Flag	Coverage Δ
integrationtests	`51.9% <ø> (-13.8%)`	⬇️
unittests	`69.9% <ø> (-0.1%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ackages/simcore-sdk/src/simcore_sdk/models/base.py	`0.0% <0.0%> (-100.0%)`	⬇️
...ges/simcore-sdk/src/simcore_sdk/models/__init__.py	`0.0% <0.0%> (-100.0%)`	⬇️
...core-sdk/src/simcore_sdk/models/pipeline_models.py	`0.0% <0.0%> (-64.6%)`	⬇️
...imcore_service_webserver/exporter/export_import.py	`34.1% <0.0%> (-61.0%)`	⬇️
..._director_v2/modules/db/repositories/comp_tasks.py	`37.7% <0.0%> (-58.2%)`	⬇️
...re-sdk/src/simcore_sdk/node_ports/serialization.py	`24.6% <0.0%> (-57.0%)`	⬇️
...c/simcore_service_director_v2/utils/async_utils.py	`32.7% <0.0%> (-56.9%)`	⬇️
...vice_webserver/exporter/formatters/formatter_v2.py	`31.9% <0.0%> (-56.4%)`	⬇️
...ore_service_director_v2/api/routes/computations.py	`35.9% <0.0%> (-55.3%)`	⬇️
...vice_webserver/exporter/formatters/formatter_v1.py	`24.5% <0.0%> (-54.2%)`	⬇️
... and 50 more

sanderegg

cool! this will help. I still need to check the changes in the test. But some things:
Please check the Dockerfiles, you mostly don't need to install wget in both dev/prod mode but in the base stage.
Also, the sidecar most probably should not be part of the default network. principle of the least services in the network.

sanderegg · 2021-07-26T16:11:36Z

services/api-server/Dockerfile

-    gosu nobody true
+  apt-get update; \
+  apt-get install -y gosu; \
+  rm -rf /var/lib/apt/lists/*; \


if wget is needed in both dev/prod stages why don't you install it here?

sanderegg · 2021-07-26T16:14:41Z

services/api-server/Dockerfile


 ENTRYPOINT [ "/bin/sh", "services/api-server/docker/entrypoint.sh" ]
-CMD ["/bin/sh", "services/api-server/docker/boot.sh"]
+CMD ["/bin/sh", "-c",  "wait-for http://migration:8000 -- services/api-server/docker/boot.sh"]


We are going to have an issue in the deployments here.
all the apps are renamed using $SWARM_STACK_NAME_migration. Please double-check and correct as this will fail for sure.

sanderegg · 2021-07-26T16:15:43Z

services/catalog/Dockerfile

+  apt-get update; \
+  apt-get install -y gosu; \
+  rm -rf /var/lib/apt/lists/*; \
+  # verify that the binary works


same as above and for all the subsequent Dockefiles

sanderegg · 2021-07-26T16:16:18Z

services/catalog/Dockerfile


 ENTRYPOINT [ "/bin/sh", "services/catalog/docker/entrypoint.sh" ]
-CMD ["/bin/sh", "services/catalog/docker/boot.sh"]
+CMD ["/bin/sh", "-c",  "wait-for http://migration:8000 -- services/catalog/docker/boot.sh"]


never too sure whether the -c is needed or not. and also please use the verbose option (for all the Dockerfiles)

sanderegg · 2021-07-26T16:18:42Z

services/docker-compose.devel.yml

@@ -158,6 +158,7 @@ services:
      - SWARM_STACK_NAME=${SWARM_STACK_NAME:-simcore}
      - SIDECAR_HOST_HOSTNAME_PATH=${SIDECAR_HOST_HOSTNAME_PATH:-/home/scu/hostname}
    networks:
+      - default


if you need to put that service in the whole network (default) because of the wait-for then we should find another solution. We should prevent services from being in the default network if possible. Can't we have a wait-for for another service or find a different network construction? we can check the network/swarm diagrams.

after double-checking, is this also needed for the "normal" sidecar? if not then I'm ok with this.

sanderegg · 2021-07-26T16:19:00Z

services/docker-compose.devel.yml

@@ -215,6 +216,7 @@ services:
      - SIDECAR_HOST_HOSTNAME_PATH=${SIDECAR_HOST_HOSTNAME_PATH:-/home/scu/hostname}
      - TARGET_MPI_NODE_CPU_COUNT=${DEV_PC_CPU_COUNT:-0} # development computer CPU count, if env var is missing put to 0 to disable
    networks:
+      - default


sanderegg · 2021-07-26T16:20:25Z

services/docker-compose.yml

@@ -153,6 +153,8 @@ services:
      - DIRECTOR_V2_PORT=${DIRECTOR_V2_PORT:-8000}
      - STORAGE_HOST=${STORAGE_HOST:-storage}
      - STORAGE_PORT=${STORAGE_PORT:-8080}
+      - MIGRATION_HOST=${MIGRATION_HOST:-migration}


you are not using that ENV no?

sanderegg · 2021-07-26T16:22:02Z

services/docker-compose.yml

@@ -227,6 +229,7 @@ services:
      START_AS_MODE_CPU: ${SIDECAR_FORCE_CPU_NODE:-0}

    networks:
+      - default


ok, so there is no reason to have the sidecar part of the default. Also, it does not need to access the postgres before the director sends computation tasks. so probably you do not need to have that change (and also no wait for from the sidecar).
Maybe you can even fine tune a bit these things?

sanderegg · 2021-07-26T16:22:26Z

services/docker-compose.yml

@@ -227,6 +229,7 @@ services:
      START_AS_MODE_CPU: ${SIDECAR_FORCE_CPU_NODE:-0}

    networks:
+      - default


ok, so there is no reason to have the sidecar part of the default. Also, it does not need to access the postgres before the director sends computation tasks. so probably you do not need to have that change (and also no wait for from the sidecar).
Maybe you can even fine tune a bit these things?

sanderegg · 2021-07-26T16:22:47Z

services/migration/Dockerfile

@@ -52,6 +52,7 @@ WORKDIR /build
 # install only base 3rd party dependencies
 COPY  --chown=scu:scu packages/postgres-database/ .
 RUN pip --no-cache-dir --quiet install .[migration]
+RUN pip --no-cache-dir install fastapi uvicorn


why not using the usual requirements? please keep it the same for all services.

sanderegg

reviewed the changes in the test.
I think the original idea of that test is that when we start the stack, all services shall start without erroring (even once). This test shall detect that problem.
I think the way you changed it will not detect this fact anymore right?

Also please be very strict with the terminology. Docker service -> tasks = slots = containers.

sanderegg · 2021-07-27T06:43:21Z

tests/swarm-deploy/test_swarm_runs.py

@@ -210,21 +244,26 @@ def get_tasks_summary(tasks):
    return msg


+def get_task_logs(task, service_name, docker_client):


Suggested change

def get_task_logs(task, service_name, docker_client):

def _get_task_logs(task, service_name, docker_client):

sanderegg · 2021-07-27T06:59:35Z

tests/swarm-deploy/test_swarm_runs.py

-            task = service.tasks()[i]
+    # assert number of slots equals number of expected services
+    tasks = service.tasks()
+    slots: Set[int] = {t["Slot"] for t in tasks}


can you double-check:
I think this test is meant to test that the services are started in one trial. they should not be restarting several times until they can run.
And can you explain what is precisely a task slot? as far as I know, the service spawns a task that runs a container. if that container goes down/fails/completes, then the service will start another task. So what is a slot? if a slot == task then I would not change the terminology cause this will cause confusion.

basically if we want to have 8 sidecars, there should be 8 tasks, not anything else. If that is the case then we have something broken in the sidecar that needs fixing.

First of all, let me address the tasks and slots difference.

Each time a new service is started a new task for it is created for each replica. Each task has a unique slot assigned to it. This slot is an integer (usually you see it in the task name).
What they never say in the documentation (and is not obvious at all), if a task fails (regardless the of the reason), there will be multiple entries for the tasks with the same slot id.

To better explain see the below example. Start by defining a service with an image which dose not exist.

docker service create --replicas 1 --name aservice missingimage:latest2

To understand what is happening you actually need to use the --no-trunc option like below. Wait a few seconds before running the below command:

docker service ps --no-trunc aservice

The output will be similar to the following:

ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS 44oiom4ba6rceyydxclrvaf9h aservice.1 missingimage:latest2 neagu-wkst Ready Rejected 2 seconds ago "No such image: missingimage:latest2" lh15jmwbgmdj1k9xs2zugc8mj \_ aservice.1 missingimage:latest2 neagu-wkst Shutdown Rejected 7 seconds ago "No such image: missingimage:latest2" lo0e312k39se79ljdi8d9m1v5 \_ aservice.1 missingimage:latest2 neagu-wkst Shutdown Rejected 12 seconds ago "No such image: missingimage:latest2" qzfnxfskeqq0swqvrte57ikxm \_ aservice.1 missingimage:latest2 neagu-wkst Shutdown Rejected 17 seconds ago "No such image: missingimage:latest2" bt7z3ykrn3r9y08t37qrwgtwi \_ aservice.1 missingimage:latest2 neagu-wkst Shutdown Rejected 22 seconds ago "No such image: missingimage:latest2"

The information is not all that clear, but it goes as follows:

the second line actually refers to the service itself and it states in what state the service is desired to be

lines 3-6 all contain a Task named \_ aservice.1, but all have different IDs, each of them complains that the image is missing (which is to be expected)

so lines 3-6 actually refer to the same Task and have the same slot number but are in different states.

when the tasks are recovered tasks = service.tasks() all the above will be returned, in this case, 4 tasks will be returned

Currently, in the CI, when Posgress is unreachable there are 2 (or more) entries for the same task and slot id. One or more are in "Failed" and one is in "Running". Because the order of in which these tasks are listed is not guaranteed, you get random flaky CI tests and get to restart the CI countless times.

In conclusion:

you can have more than one task listed tasks = service.tasks() for the same task/slot

if there are 8 replicas, there might be more than 8 tasks listed

I need to rename some of the variables inside the function for clarity, some of them are not very well named

ok, so I think there is a small confusion there.
I did exactly as what you said but replaced replicas=1 by replicas=4:

and I obtained this (by the way, the no-trunc param only changes the ID from short to long so it makes no difference):

docker service ps aservice ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS y3zjnendm6ip aservice.1 missingimage:latest sanderegg-All-Series Ready Rejected 2 seconds ago "No such image: missingimage:l…" tm4enmgl81za \_ aservice.1 missingimage:latest sanderegg-All-Series Shutdown Rejected 6 seconds ago "No such image: missingimage:l…" ie904nrd8gll \_ aservice.1 missingimage:latest sanderegg-All-Series Shutdown Rejected 12 seconds ago "No such image: missingimage:l…" gmfelojsodpq \_ aservice.1 missingimage:latest sanderegg-All-Series Shutdown Rejected 14 seconds ago "No such image: missingimage:l…" w6ngk0fjtclo aservice.2 missingimage:latest sanderegg-All-Series Ready Rejected 2 seconds ago "No such image: missingimage:l…" u5vxw9lnkldv \_ aservice.2 missingimage:latest sanderegg-All-Series Shutdown Rejected 6 seconds ago "No such image: missingimage:l…" pilyh86flkgj \_ aservice.2 missingimage:latest sanderegg-All-Series Shutdown Rejected 12 seconds ago "No such image: missingimage:l…" fg8kfoekjkwy \_ aservice.2 missingimage:latest sanderegg-All-Series Shutdown Rejected 14 seconds ago "No such image: missingimage:l…" z68h4c0d4e3p aservice.3 missingimage:latest sanderegg-All-Series Ready Rejected 2 seconds ago "No such image: missingimage:l…" qv8ucjr9edji \_ aservice.3 missingimage:latest sanderegg-All-Series Shutdown Rejected 7 seconds ago "No such image: missingimage:l…" yvl4u4y8z41h \_ aservice.3 missingimage:latest sanderegg-All-Series Shutdown Rejected 12 seconds ago "No such image: missingimage:l…" yto0eghef6un \_ aservice.3 missingimage:latest sanderegg-All-Series Shutdown Rejected 14 seconds ago "No such image: missingimage:l…" 4m8obbbjbob3 aservice.4 missingimage:latest sanderegg-All-Series Ready Rejected 2 seconds ago "No such image: missingimage:l…" cjamc81eot17 \_ aservice.4 missingimage:latest sanderegg-All-Series Shutdown Rejected 6 seconds ago "No such image: missingimage:l…" h6xeuyaj13ww \_ aservice.4 missingimage:latest sanderegg-All-Series Shutdown Rejected 12 seconds ago "No such image: missingimage:l…" dn45522r2ysx \_ aservice.4 missingimage:latest sanderegg-All-Series Shutdown Rejected 14 seconds ago "No such image: missingimage:l…"

Doing so effectively creates a service with 4 tasks. that is what we can see as aservice.1, aservice.2, aservice.3 and aservice.4. Respectively on lines where there is no \_ sign.
The lines with the \_ signs are the history and are shutdown. Always. This is the same that you see in the Portainer actually.

So now to come back to that test. we normally have 8 replicas of the sidecar. so there should be exactly 8 tasks not more not less. if there more, this means the first ones failed to start and this is something we would like to avoid since we want to have a controlled startup.

sanderegg · 2021-07-27T07:01:23Z

tests/swarm-deploy/test_swarm_runs.py

+
+    assert slot_count == expected_replicas, (
+        f"Expected to have {expected_replicas} slot(s), "
+        f"instead {slot_count} slot(s) were found."


same here. Please check this article it says a slot is "analogous" to a task. and it is nowhere visible when issuing docker service ps.

sanderegg · 2021-07-27T07:04:05Z

tests/swarm-deploy/test_swarm_runs.py

+        ]
+        return len(running_tasks) > 0
+
+    # check at least one service in per slot is in running mode else raise error


this comment is very confusing. Please check the terminology.
A service -> creates a bunch of tasks (that are analogous to slots), which in fine are containers. not the contrary.

sanderegg · 2021-07-27T07:05:09Z

tests/swarm-deploy/test_swarm_runs.py

+        return len(running_tasks) > 0
+
+    # check at least one service in per slot is in running mode else raise error
+    tasks_by_slot: Dict[int, Deque[Dict[str, Any]]] = {}


what does tasks_by_slot mean? since it is supposed to be the same (see article above)?

migration service starts an http server

8968fa8

Andrei Neagu added 5 commits July 9, 2021 11:46

added wait-for script

3d58640

exposed service on all ports

6b3f091

added wait-for in docker files

0c80cad

Merge remote-tracking branch 'upstream/master' into fix-boot-order

7ce9c33

sidecars were not connected to the network

18c6b0e

GitHK added bug buggy, it does not work as expected t:maintenance Some planned maintenance work labels Jul 9, 2021

GitHK added this to the Wombat milestone Jul 9, 2021

GitHK self-assigned this Jul 9, 2021

Andrei Neagu added 6 commits July 9, 2021 13:02

storage requires migration

dd515e6

Merge remote-tracking branch 'upstream/master' into fix-boot-order

2a3cdce

adding more debugging info for test_core_service_running when fialing

f914357

adding more debug logging

165d832

removing failing command

f8719f6

fix test causing CI to fail

aae2534

GitHK changed the title ~~🐛 fixing failing [sys] deploy simcore~~ 🐛 fixing randomly failing [sys] deploy simcore Jul 21, 2021

GitHK marked this pull request as ready for review July 21, 2021 08:29

GitHK requested review from colinRawlings, mguidon, odeimaiz, pcrespov and sanderegg July 21, 2021 08:29

Andrei Neagu added 3 commits July 21, 2021 13:00

fixes broken integration tests

5322703

removed not neededmmigration service

76798da

trying to fix failing tests

55f5f10

sanderegg requested changes Jul 26, 2021

View reviewed changes

sanderegg reviewed Jul 27, 2021

View reviewed changes

sanderegg mentioned this pull request Jul 28, 2021

Disable test_swarm_runs failing 2/3 of the time #2450

Merged

sanderegg closed this in #2450 Jul 28, 2021

GitHK deleted the fix-boot-order branch July 28, 2021 14:58

GitHK restored the fix-boot-order branch July 28, 2021 14:58

sanderegg mentioned this pull request Aug 4, 2021

maintenance/scaling of the platform ITISFoundation/osparc-issues#428

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 fixing randomly failing [sys] deploy simcore #2430

🐛 fixing randomly failing [sys] deploy simcore #2430

GitHK commented Jul 9, 2021 •

edited

Loading

codecov bot commented Jul 9, 2021 •

edited

Loading

sanderegg left a comment

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg Jul 26, 2021

sanderegg left a comment

sanderegg Jul 27, 2021

sanderegg Jul 27, 2021

sanderegg Jul 27, 2021

GitHK Jul 28, 2021

GitHK Jul 28, 2021

sanderegg Jul 28, 2021

sanderegg Jul 27, 2021

sanderegg Jul 27, 2021

sanderegg Jul 27, 2021

		@@ -210,21 +244,26 @@ def get_tasks_summary(tasks):
		return msg


		def get_task_logs(task, service_name, docker_client):

	def get_task_logs(task, service_name, docker_client):
	def _get_task_logs(task, service_name, docker_client):

🐛 fixing randomly failing [sys] deploy simcore #2430

🐛 fixing randomly failing [sys] deploy simcore #2430

Conversation

GitHK commented Jul 9, 2021 • edited Loading

What do these changes do?

Issue 1

Issue 2

Related issue/s

How to test

Checklist

codecov bot commented Jul 9, 2021 • edited Loading

Codecov Report

sanderegg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanderegg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

First of all, let me address the tasks and slots difference.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GitHK commented Jul 9, 2021 •

edited

Loading

codecov bot commented Jul 9, 2021 •

edited

Loading