Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad allocation getting killed and transferred to another node in the cluster #3289

Closed
smohankarthik opened this issue Sep 27, 2017 · 7 comments

Comments

@smohankarthik
Copy link

Nomad version

Nomad v0.6.3

Operating system and Environment details

Ubuntu 16.04(Nomad Server)/Windows 2016 Server(Nomad Client Node)

Issue

When I try to deploy services to the Windows nodes(it's a 2 Node Cluster). My services are running in the beginning of the process but apparently after 10 mins the processes are being killed and later moved to the other Node in client's cluster.
I cross-checked with the node which having this problem ,it works when I drained the other node and deployed all services in 1 Node but it doesn't work when its 2 or a 3 node cluster.

In the recent activities its actually showing that its being getting killed.

$ nomad alloc-status 71e73261
ID                  = 71e73261
Eval ID             = 2d27e486
Name                = xxxxxxxx
Node ID             = c313f583
Job ID              = test
Job Version         = 35
Client Status       = complete
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc is lost since its node is down
Created At          = 09/27/17 05:11:09 UTC
Deployment ID       = 12c0c923
Deployment Health   = healthy

Task "xxxxxxxxxxx" is "dead"
Task Resources
CPU      Memory  Disk     IOPS  Addresses
100 MHz  10 MiB  300 MiB  0     http: 10.0.0.162:28060
                                orleans_cluster: 10.0.0.162:24547
                                orleans_proxy: 10.0.0.162:28553

Task Events:
Started At     = 09/27/17 05:13:05 UTC
Finished At    = 09/27/17 05:23:09 UTC
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                               Type                        Description
09/27/17 05:23:08 UTC   Killed                   Task successfully killed
09/27/17 05:23:06 UTC   Killing                  Sent interrupt. Waiting 5s before force killing
09/27/17 05:13:05 UTC   Started                Task started by client
09/27/17 05:10:55 UTC   Downloading Artifacts  Client is downloading artifacts
09/27/17 05:10:55 UTC   Task Setup             Building Task Directory
09/27/17 05:10:55 UTC   Received               Task received by client

In the recent activities its actually showing that its being getting killed.

Nomad Server logs

I find that nomad heartbeat TTL expired .
I'm not quite sure but this is what I found in the nomad server log.


2017/09/27 05:23:17.299258 [WARN] nomad.heartbeat: node 'c313f583-e0ea-9bfb-93f7-f4c0de2fddd8' TTL expired
2017/09/27 05:23:17.474754 [DEBUG] sched: <Eval 'c12ee9f5-dbb4-6938-8eec-2bd1ce17f622' JobID: 'test'>: Total changes: (place 4) (destructive 0) (inplace 0) (stop 4)
Created Deployment: "48bc3767-9a0b-2d6b-407e-f5fdcd1fe66e"
Desired Changes for "service1": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
Desired Changes for "service2": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "service3": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
Desired Changes for "service4": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
Desired Changes for "service5": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "service6": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
Desired Changes for "service7": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "service8": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
@dadgar
Copy link
Contributor

dadgar commented Sep 27, 2017

@smohankarthik Can you try out the Nomad 0.7 beta on the client? We have improved the clients heartbeat reliability. Also it would be good to see the logs from the client around the TTL.

@smohankarthik
Copy link
Author

smohankarthik commented Sep 28, 2017

@dadgar I changed to beta version and didn't come across Heartbeat issue.

I have a couple of queries.

  1. Just curious is there any way to increase heartbeat manually.? Because I never faced this kind of issue regarding to heartbeat in the past. The deployment was smooth most of the times.
  2. From where can I find the log files in the windows. My old allocation got removed so can’t see the log of the allocation.

Nomad version

Nomad v0.7.0 Beta

Now after upgrading the client version,I am facing another error Driver Failure regarding to rpc failure and timeout while plugin to start .

$ nomad status ed397a74
ID                  = ed397a74
Eval ID             = 09ab1f5a
Name                = xxxxxxx
Node ID             = c313f583
Job ID              = test
Job Version         = 44
Client Status       = failed
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 09/28/17 04:37:48 UTC
Deployment ID       = 41f43c6f
Deployment Health   = unhealthy

Task "service1" is "dead"
Task Resources
CPU      Memory  Disk     IOPS  Addresses
100 MHz  10 MiB  300 MiB  0     http: 10.0.0.162:23367
                                orleans_cluster: 10.0.0.162:20963
                                orleans_proxy: 10.0.0.162:24310

Task Events:
Started At     = 09/28/17 04:53:23 UTC
Finished At    = 09/28/17 04:59:03 UTC
Total Restarts = 3
Last Restart   = 09/28/17 12:57:45 +0800

Recent Events:
Time                   Type            Description
09/28/17 04:59:03 UTC  Not Restarting  Error was unrecoverable
09/28/17 04:59:03 UTC  Driver Failure  failed to start task "service1" for alloc "ed397a74-d9b6-5fde-2f4f-1a363b193881": error creating rpc client for executor plugin: timeout while waiting for plugin to start
09/28/17 04:57:45 UTC  Restarting      Task restarting in 16.458998203s
09/28/17 04:57:45 UTC  Terminated      Exit Code: 3762504530
09/28/17 04:53:23 UTC  Started         Task started by client
09/28/17 04:52:33 UTC  Restarting      Task restarting in 15.246803717s
09/28/17 04:52:33 UTC  Terminated      Exit Code: 3762504530
09/28/17 04:48:09 UTC  Started         Task started by client
09/28/17 04:47:22 UTC  Restarting      Task restarting in 15.099004097s
09/28/17 04:47:22 UTC  Terminated      Exit Code: 3762504530

@schmichael
Copy link
Member

  1. Just curious is there any way to increase heartbeat manually?

There are a few heartbeat related settings on the server: https://www.nomadproject.io/docs/agent/configuration/server.html#heartbeat_grace

Increasing the grace setting is probably the most straightforward way to give clients more time to heartbeat.

  1. From where can I find the log files in the windows. My old allocation got removed so can’t see the log of the allocation.

Is there any chance you can reproduce this behavior? The logs would be very useful in debugging (including the ${alloc_dir}/${alloc_id}/${task_name}/executor.out log file), and if the allocation has already been GC'd there's really nothing you can do to recover the logs.

@mbp
Copy link

mbp commented Nov 24, 2017

@smohankarthik

Did you ever figure out a fix? I seem to be having a similar problem after upgrading to Nomad 0.7.0.

Time                   Type        Description
11/24/17 09:57:51 GMT  Killed      Task successfully killed
11/24/17 09:57:51 GMT  Restarting  Task restarting in 18.122063691s
11/24/17 09:57:51 GMT  Terminated  Exit Code: 0, Exit Message: "unexpected EOF"
11/24/17 09:49:08 GMT  Started     Task started by client
11/24/17 09:49:06 GMT  Task Setup  Building Task Directory
11/24/17 09:49:06 GMT  Received    Task received by client```
Time                   Type            Description
11/24/17 09:58:12 GMT  Not Restarting  Error was unrecoverable
11/24/17 09:58:12 GMT  Driver Failure  failed to start task "xxx" for alloc "e3081fa3-c96d-4e5a-7f4d-3580389a3d7a": unable to dispense the executor plugin: EOF
11/24/17 09:57:57 GMT  Task Setup      Building Task Directory
11/24/17 09:57:57 GMT  Received        Task received by client

@smohankarthik
Copy link
Author

Try to cleaning up everything and retry it . Hope it works for you also, it worked for me. I am not sure what was the issue.

@dadgar
Copy link
Contributor

dadgar commented Dec 6, 2017

I am going to close this since the original issue seems to be fixed and the other doesn't look related.

@dadgar dadgar closed this as completed Dec 6, 2017
@github-actions
Copy link

github-actions bot commented Dec 5, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants