-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 fixing randomly failing [sys] deploy simcore #2430
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2430 +/- ##
========================================
- Coverage 76.0% 71.7% -4.3%
========================================
Files 574 574
Lines 21629 21629
Branches 2077 2077
========================================
- Hits 16445 15517 -928
- Misses 4634 5613 +979
+ Partials 550 499 -51
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! this will help. I still need to check the changes in the test. But some things:
Please check the Dockerfiles, you mostly don't need to install wget in both dev/prod mode but in the base stage.
Also, the sidecar most probably should not be part of the default network. principle of the least services in the network.
gosu nobody true | ||
apt-get update; \ | ||
apt-get install -y gosu; \ | ||
rm -rf /var/lib/apt/lists/*; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if wget is needed in both dev/prod stages why don't you install it here?
|
||
ENTRYPOINT [ "/bin/sh", "services/api-server/docker/entrypoint.sh" ] | ||
CMD ["/bin/sh", "services/api-server/docker/boot.sh"] | ||
CMD ["/bin/sh", "-c", "wait-for http://migration:8000 -- services/api-server/docker/boot.sh"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to have an issue in the deployments here.
all the apps are renamed using $SWARM_STACK_NAME_migration
. Please double-check and correct as this will fail for sure.
apt-get update; \ | ||
apt-get install -y gosu; \ | ||
rm -rf /var/lib/apt/lists/*; \ | ||
# verify that the binary works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above and for all the subsequent Dockefiles
|
||
ENTRYPOINT [ "/bin/sh", "services/catalog/docker/entrypoint.sh" ] | ||
CMD ["/bin/sh", "services/catalog/docker/boot.sh"] | ||
CMD ["/bin/sh", "-c", "wait-for http://migration:8000 -- services/catalog/docker/boot.sh"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never too sure whether the -c is needed or not. and also please use the verbose option (for all the Dockerfiles)
@@ -158,6 +158,7 @@ services: | |||
- SWARM_STACK_NAME=${SWARM_STACK_NAME:-simcore} | |||
- SIDECAR_HOST_HOSTNAME_PATH=${SIDECAR_HOST_HOSTNAME_PATH:-/home/scu/hostname} | |||
networks: | |||
- default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you need to put that service in the whole network (default) because of the wait-for then we should find another solution. We should prevent services from being in the default network if possible. Can't we have a wait-for for another service or find a different network construction? we can check the network/swarm diagrams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after double-checking, is this also needed for the "normal" sidecar? if not then I'm ok with this.
@@ -215,6 +216,7 @@ services: | |||
- SIDECAR_HOST_HOSTNAME_PATH=${SIDECAR_HOST_HOSTNAME_PATH:-/home/scu/hostname} | |||
- TARGET_MPI_NODE_CPU_COUNT=${DEV_PC_CPU_COUNT:-0} # development computer CPU count, if env var is missing put to 0 to disable | |||
networks: | |||
- default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -153,6 +153,8 @@ services: | |||
- DIRECTOR_V2_PORT=${DIRECTOR_V2_PORT:-8000} | |||
- STORAGE_HOST=${STORAGE_HOST:-storage} | |||
- STORAGE_PORT=${STORAGE_PORT:-8080} | |||
- MIGRATION_HOST=${MIGRATION_HOST:-migration} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are not using that ENV no?
@@ -227,6 +229,7 @@ services: | |||
START_AS_MODE_CPU: ${SIDECAR_FORCE_CPU_NODE:-0} | |||
|
|||
networks: | |||
- default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so there is no reason to have the sidecar part of the default. Also, it does not need to access the postgres before the director sends computation tasks. so probably you do not need to have that change (and also no wait for from the sidecar).
Maybe you can even fine tune a bit these things?
@@ -227,6 +229,7 @@ services: | |||
START_AS_MODE_CPU: ${SIDECAR_FORCE_CPU_NODE:-0} | |||
|
|||
networks: | |||
- default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so there is no reason to have the sidecar part of the default. Also, it does not need to access the postgres before the director sends computation tasks. so probably you do not need to have that change (and also no wait for from the sidecar).
Maybe you can even fine tune a bit these things?
@@ -52,6 +52,7 @@ WORKDIR /build | |||
# install only base 3rd party dependencies | |||
COPY --chown=scu:scu packages/postgres-database/ . | |||
RUN pip --no-cache-dir --quiet install .[migration] | |||
RUN pip --no-cache-dir install fastapi uvicorn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not using the usual requirements? please keep it the same for all services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed the changes in the test.
I think the original idea of that test is that when we start the stack, all services shall start without erroring (even once). This test shall detect that problem.
I think the way you changed it will not detect this fact anymore right?
Also please be very strict with the terminology. Docker service -> tasks = slots = containers.
@@ -210,21 +244,26 @@ def get_tasks_summary(tasks): | |||
return msg | |||
|
|||
|
|||
def get_task_logs(task, service_name, docker_client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_task_logs(task, service_name, docker_client): | |
def _get_task_logs(task, service_name, docker_client): | |
task = service.tasks()[i] | ||
# assert number of slots equals number of expected services | ||
tasks = service.tasks() | ||
slots: Set[int] = {t["Slot"] for t in tasks} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you double-check:
I think this test is meant to test that the services are started in one trial. they should not be restarting several times until they can run.
And can you explain what is precisely a task slot? as far as I know, the service spawns a task that runs a container. if that container goes down/fails/completes, then the service will start another task. So what is a slot? if a slot == task then I would not change the terminology cause this will cause confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically if we want to have 8 sidecars, there should be 8 tasks, not anything else. If that is the case then we have something broken in the sidecar that needs fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, let me address the tasks and slots difference.
Each time a new service is started a new task for it is created for each replica. Each task has a unique slot assigned to it. This slot is an integer (usually you see it in the task name).
What they never say in the documentation (and is not obvious at all), if a task fails (regardless the of the reason), there will be multiple entries for the tasks with the same slot id.
To better explain see the below example. Start by defining a service with an image which dose not exist.
docker service create --replicas 1 --name aservice missingimage:latest2
To understand what is happening you actually need to use the --no-trunc
option like below. Wait a few seconds before running the below command:
docker service ps --no-trunc aservice
The output will be similar to the following:
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
44oiom4ba6rceyydxclrvaf9h aservice.1 missingimage:latest2 neagu-wkst Ready Rejected 2 seconds ago "No such image: missingimage:latest2"
lh15jmwbgmdj1k9xs2zugc8mj \_ aservice.1 missingimage:latest2 neagu-wkst Shutdown Rejected 7 seconds ago "No such image: missingimage:latest2"
lo0e312k39se79ljdi8d9m1v5 \_ aservice.1 missingimage:latest2 neagu-wkst Shutdown Rejected 12 seconds ago "No such image: missingimage:latest2"
qzfnxfskeqq0swqvrte57ikxm \_ aservice.1 missingimage:latest2 neagu-wkst Shutdown Rejected 17 seconds ago "No such image: missingimage:latest2"
bt7z3ykrn3r9y08t37qrwgtwi \_ aservice.1 missingimage:latest2 neagu-wkst Shutdown Rejected 22 seconds ago "No such image: missingimage:latest2"
The information is not all that clear, but it goes as follows:
- the second line actually refers to the service itself and it states in what state the service is desired to be
- lines 3-6 all contain a Task named
\_ aservice.1
, but all have different IDs, each of them complains that the image is missing (which is to be expected) - so lines 3-6 actually refer to the same Task and have the same slot number but are in different states.
- when the tasks are recovered
tasks = service.tasks()
all the above will be returned, in this case, 4 tasks will be returned - Currently, in the CI, when Posgress is unreachable there are 2 (or more) entries for the same task and slot id. One or more are in "Failed" and one is in "Running". Because the order of in which these tasks are listed is not guaranteed, you get random flaky CI tests and get to restart the CI countless times.
In conclusion:
- you can have more than one task listed
tasks = service.tasks()
for the same task/slot - if there are 8 replicas, there might be more than 8 tasks listed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to rename some of the variables inside the function for clarity, some of them are not very well named
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so I think there is a small confusion there.
I did exactly as what you said but replaced replicas=1 by replicas=4:
and I obtained this (by the way, the no-trunc param only changes the ID from short to long so it makes no difference):
docker service ps aservice
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
y3zjnendm6ip aservice.1 missingimage:latest sanderegg-All-Series Ready Rejected 2 seconds ago "No such image: missingimage:l…"
tm4enmgl81za \_ aservice.1 missingimage:latest sanderegg-All-Series Shutdown Rejected 6 seconds ago "No such image: missingimage:l…"
ie904nrd8gll \_ aservice.1 missingimage:latest sanderegg-All-Series Shutdown Rejected 12 seconds ago "No such image: missingimage:l…"
gmfelojsodpq \_ aservice.1 missingimage:latest sanderegg-All-Series Shutdown Rejected 14 seconds ago "No such image: missingimage:l…"
w6ngk0fjtclo aservice.2 missingimage:latest sanderegg-All-Series Ready Rejected 2 seconds ago "No such image: missingimage:l…"
u5vxw9lnkldv \_ aservice.2 missingimage:latest sanderegg-All-Series Shutdown Rejected 6 seconds ago "No such image: missingimage:l…"
pilyh86flkgj \_ aservice.2 missingimage:latest sanderegg-All-Series Shutdown Rejected 12 seconds ago "No such image: missingimage:l…"
fg8kfoekjkwy \_ aservice.2 missingimage:latest sanderegg-All-Series Shutdown Rejected 14 seconds ago "No such image: missingimage:l…"
z68h4c0d4e3p aservice.3 missingimage:latest sanderegg-All-Series Ready Rejected 2 seconds ago "No such image: missingimage:l…"
qv8ucjr9edji \_ aservice.3 missingimage:latest sanderegg-All-Series Shutdown Rejected 7 seconds ago "No such image: missingimage:l…"
yvl4u4y8z41h \_ aservice.3 missingimage:latest sanderegg-All-Series Shutdown Rejected 12 seconds ago "No such image: missingimage:l…"
yto0eghef6un \_ aservice.3 missingimage:latest sanderegg-All-Series Shutdown Rejected 14 seconds ago "No such image: missingimage:l…"
4m8obbbjbob3 aservice.4 missingimage:latest sanderegg-All-Series Ready Rejected 2 seconds ago "No such image: missingimage:l…"
cjamc81eot17 \_ aservice.4 missingimage:latest sanderegg-All-Series Shutdown Rejected 6 seconds ago "No such image: missingimage:l…"
h6xeuyaj13ww \_ aservice.4 missingimage:latest sanderegg-All-Series Shutdown Rejected 12 seconds ago "No such image: missingimage:l…"
dn45522r2ysx \_ aservice.4 missingimage:latest sanderegg-All-Series Shutdown Rejected 14 seconds ago "No such image: missingimage:l…"
Doing so effectively creates a service with 4 tasks. that is what we can see as aservice.1, aservice.2, aservice.3 and aservice.4. Respectively on lines where there is no \_
sign.
The lines with the \_
signs are the history and are shutdown. Always. This is the same that you see in the Portainer actually.
So now to come back to that test. we normally have 8 replicas of the sidecar. so there should be exactly 8 tasks not more not less. if there more, this means the first ones failed to start and this is something we would like to avoid since we want to have a controlled startup.
|
||
assert slot_count == expected_replicas, ( | ||
f"Expected to have {expected_replicas} slot(s), " | ||
f"instead {slot_count} slot(s) were found." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. Please check this article it says a slot is "analogous" to a task. and it is nowhere visible when issuing docker service ps
.
] | ||
return len(running_tasks) > 0 | ||
|
||
# check at least one service in per slot is in running mode else raise error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is very confusing. Please check the terminology.
A service -> creates a bunch of tasks (that are analogous to slots), which in fine are containers. not the contrary.
return len(running_tasks) > 0 | ||
|
||
# check at least one service in per slot is in running mode else raise error | ||
tasks_by_slot: Dict[int, Deque[Dict[str, Any]]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does tasks_by_slot mean? since it is supposed to be the same (see article above)?
What do these changes do?
Issue 1
When the
[sys] deploy simcore
runs, it starts the CI and checks that all the docker services are started.Because docker-swarm has no way to tell if the application is ready,
wait-for
is used to ensure all application components are booted before starting other components.Most importantly PostgresSQL is required to be running and all migrations must be applied.
Changes:
migration
service will start an http server once it's readymigration
to be ready before continuing to boot:wait-for
script was used and added to the repo; a test to ensure its present was also addedIssue 2
The
test_core_service_running
never took into consideration task slots. Each replica of a service is assigned a task slot. Multiple tasks may be present in the tasks list for each slot.Addressing this issue fixed the randomly failing CI issues 🎉
Trivia: this PR will help mainly unlucky developers, some of us could have lived without it. I personally think that restarting the CI 18 times is not ideal.
Related issue/s
How to test
Checklist