-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad jobs not being killed on ubuntu #4218
Comments
The nomad job described bascially runs the following script:
Since nomad doesn't yet support running processes and all subsequently start process in one process group, I attempt to trap the kill signal and kill the child process that is spawned by this script. I think that is probably where the problem occurs. Its seems that the process running the script is killed after the nomad job is stopped but the processed spawned by ${HERON_TOPOLOGY_DOWNLOAD_CMD} remains and nomad job remains stuck between those two states. After I manually kill the process spawned by ${HERON_TOPOLOGY_DOWNLOAD_CMD} , the status of the job finally went to "complete" |
I also tried running this simple script on ubuntu:
To see if it was any issue with the trapping code. The above script seems to be working correctly. When I kill the process running test.sh, the process running "tail -f /dev/null" is killed as expected |
Thanks for submitting this issue and detailed information- we'll take a further look into this. |
@chelseakomlo thanks for looking into this. FYI I just tested with the latest 0.8.3 release. The results are the same. |
@jerrypeng Can you try with this binary. It is a linux amd64 build of master as of e108f73. nomad.zip From that build I ran this job file:
When I do a nomad stop both the sleep and tail process get killed. Since Nomad 0.8.2, all the process should be started in the same process group: #3572 You can see that they are all in the same process group here: |
@dadgar thank you! I will try it out. |
@dadgar So I tried with the latest Nomad build you recommend me to try:
I am still getting the problem I mentioned above. Using that version of Nomad, I submitted a Heron topology application and then I kill the application:
The nomad jobs that represent executors in heron displays as being killed. However upon further inspection of one of the jobs:
As you can see the "Desired" state is "stop" but the "Status" remains "running". The jobs are stuck in this weird state. Some more details:
So the process also doesn't kill killed. The "force killing" mentioned in the "Description" never seems to happen. I see that all the processes are in the same process group as the nomad executor:
But the parent nomad executor never gets killed either. Any thoughts? |
@dadgar so actually something weird is going on. The "python2.7 ./heron-core/bin/heron-executor" process is not a child of the nomad executor:
but how come the example you ran:
The shell script is a child process of the nomad executor. Why is that the case for example script but not the heron-nomad.sh script which are pretty much identical? |
Further investigation I noticed the following. When the nomad jobs of a Heron topology is running, the process of /bin/sh run_heron_executor.sh is the parent of heron-executor and all subprocesses which is correct and expected:
however after I kill the topology and the corresponding nomad jobs get stopped:
The "/bin/sh run_heron_executor.sh" process disappears or gets killed but the subprocess don't. Very weird. |
@jerrypeng I am having quite a hard time following the heron standalone guide. I am running this:
And I see the task is failing:
Job Inspect: https://gist.github.com/dadgar/2316b24f0c985c7b46fc510f02fdf01e Do you have a VM or simpler reproduction steps? |
@jerrypeng If you repeat the steps from #4218 (comment) can you show what the process group is for each of those? I wonder if the python executor is changing its process group |
@dadgar are you trying to submit to an existing Nomad cluster or did you also start the cluster via "heron-admin standalone cluster"? |
@jerrypeng So I ran: I followed the guide here exactly: https://apache.github.io/incubator-heron/docs/operators/deployment/schedulers/standalone/ The only difference is I replaced heron-nomad with 0.8.4-dev binary. I am also running zookeeper in the started Nomad cluster before starting that topology via:
Did you see that the ENV for the downloader has null in it? |
@dadgar ya I saw the null in the URL. That usually means that the hostname could not be determined. Are you running this on a VM via vagrant? I will try to reproduce. Ya, you are right about the heron-executor changing the process group:
I just realized in the code of heron_executor.py. There is code to create a new process group: |
However at the same time, though I am not a bash expert, shouldn't the trap clause in the script kill the heron-executor process when the run_heron_executor.sh gets killed by Nomad when the job is stopped? |
@dadgar so in Kubernetes, it supports a prepare phase that executes some commands prior to the main service you want to invoke. Is there something similar in Nomad? Basically a task that gets run before your "main" task. If there is than, all the downloading of jars and creation of directories that is done run_heron_executor.sh can be moved to that phase and the "main" task can directly invoke the heron-executor command side stepping this problem. |
So I can also reproduce the problem with the following job spec:
and python file test.py:
Though not sure if you would consider this an application issue or issue with nomad The processes test.py and child processes will never be killed by Nomad. And the nomad job will always be in limbo state:
|
@jerrypeng Okay thanks for getting me the reproducers. So I have good and bad news. The good news is:
Bad news is:
So the end outcome is that in the most common scenario where Nomad is being run as root this is entirely solved. When not on root and not on Linux, there actually isn't anything Nomad can do because the task is essentially creating a daemon and not forwarding signals properly so it gets leaked. I added a note to that in the docs. Essentially if you are going to daemonize, you need to forward signals properly such that the grandchild process gets killed. |
@dadgar thanks a ton for taking the time to investigate this! |
@jerrypeng Of course! Give the RC a shot and let me know if it doesn't fix the issue when you run as root on linux: https://releases.hashicorp.com/nomad/0.8.4-rc1/ |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If you have a question, prepend your issue with
[question]
or preferably use the nomad mailing list.If filing a bug please include the following:
Nomad version
0.7.0 and 0.8.1
Operating system and Environment details
ubuntu 16.04
Issue
I noticed a interesting problem with Nomad running in Ubuntu.
I have a nomad job running via the raw_exec driver. When I try to kill the job via "nomad stop", the job is not being killed but is stuck between "stop" state and "running" state
Here is some info on the job:
As you can see the "Desired" state for the job is "stop" but the actual state is "running". The job is stuck in this condition. I have waited a while but the job never actually gets killed or gets transitioned into the "stop" state.
The above output is only but one of many jobs/containers that make a Heron job or topology
I only see this problem occurring on ubuntu and but not on centos. I have tested with both Nomad 0.7.0 and the latest 0.8.1. Both releases seems to have this problem on ubuntu.
This problem is impacting Heron Standalone cluster deployments (which use Nomad for scheduling) on ubuntu. Users has said that they had to manually kill all the processes since Nomad failed to kill them.
Reproduction steps
Submitted a Heron topology to a Heron standalone cluster using Nomad. Steps described here:
https://apache.github.io/incubator-heron/docs/operators/deployment/schedulers/standalone/
Nomad Server logs (if appropriate)
nomad_server_log.txt
Nomad Client logs (if appropriate)
nomad_clien_log.txt
Job file (if appropriate)
jobfile.txt
The text was updated successfully, but these errors were encountered: