-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: stopped job's tasks in raw_exec driver get a signal to exit only after few minutes #2133
Comments
Just out of curiosity, can you post your wrapper script? |
@pgporada sorry for the delay, look in the last lines for the workaround. I hope u'll find the script useful. #!/bin/bash
# handler for the signals sent from nomad to stop the container
my_exit()
{
echo "killing $CID"
docker stop --time=5 $CID # try to stop it gracefully
docker rm -f $CID # remove the stopped container
}
trap 'my_exit; exit' SIGHUP SIGTERM SIGINT
# for debugging
echo `env`
echo
# Building docker run command
CMD="docker run -d --name ${NOMAD_TASK_NAME}-${NOMAD_ALLOC_ID}"
for a in "$@"; do
CMD="$CMD $a"
done
echo docker wrapper: the docker command that will run is: $CMD
echo from here on it is the container output:
echo
# actually running the command
CID=`$CMD`
# docker logs is printed in the background
docker logs -f $CID &
# allows the process to listen to signals every 3 seconds
while :
do
sleep 5
# next few lines are for monitoring the container and exiting if it is not running
CSTATUS=`docker inspect --format='{{.State.Status}}' $CID`
if ! [ -z "${CSTATUS}" ]; then
if [ "${CSTATUS}" != "running" ] && [ "${CSTATUS}" != "paused" ]; then
echo "Error - container is not in desired state - status is ${CSTATUS}. exiting"
my_exit; exit;
fi
else
echo "Error - container cannot be found exiting task... $CSTATUS"
my_exit; exit;
fi
# workaround nomad bug
nomad status ${NOMAD_JOB_NAME} &> /dev/null
RET_VAL=$?
if [ "${RET_VAL}" -ne 0 ]; then
echo "going to exit since task job is not running any more"
my_exit; exit;
fi
done |
@OferE Hey when you say 10 jobs can you clarify? Do you mean 10 task groups spread across 20 machines and what are you doing letting them finish to completion or doing a Could you show any logs from a "missed signal"? Another test you can do to isolate between a Nomad issue and a setup issue is to remove starting a docker container and just running a sleep loop and seeing if the signal is received by your script. If it is Nomad is doing the right thing. |
10 jobs each containing few tasks. |
Hi, I'll try to reproduce next week as i won't be at work this week and I already worked around this (look at the code above). |
@OferE Thank you! |
Closed by. #2177 |
thank u so much for solving this. highly appreciated! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.5.1
Operating system and Environment details
Ubuntu 14.04.5 LTS
Issue
I'm using raw_exec and i wrapped my executable with a script that traps signal.
This works nice except in scale.
when working with 20 machines and 10 jobs not all tasks are getting the signal to stop fast enough.
Only after few minutes! the signal is sent to all the tasks. but few minutes is too much.
When the signal is sent the tasks end correctly.
Reproduction steps
I can't place here my entire cluster config. but it happens every time - few containers are still running. the logs prove they didn't get the signal.
Just scale a cluster to 20 machines and 10 jobs and u'll see it.
Workaround
I workaround this bug by putting in my wrapper script a check for
nomad status ${NOMAD_JOB_NAME}
and i kill the container if the job is not there.
Edit
This is a serious bug - i found out about it because i used docker swarm side by side with nomad and i saw the containers still running, I hope u will do this experiment and prioritize a fix to this critical bug.
The text was updated successfully, but these errors were encountered: