Nomad allocation getting killed and transferred to another node in the cluster #3289

smohankarthik · 2017-09-27T07:39:03Z

Nomad version

Nomad v0.6.3

Operating system and Environment details

Ubuntu 16.04(Nomad Server)/Windows 2016 Server(Nomad Client Node)

Issue

When I try to deploy services to the Windows nodes(it's a 2 Node Cluster). My services are running in the beginning of the process but apparently after 10 mins the processes are being killed and later moved to the other Node in client's cluster.
I cross-checked with the node which having this problem ,it works when I drained the other node and deployed all services in 1 Node but it doesn't work when its 2 or a 3 node cluster.

In the recent activities its actually showing that its being getting killed.

$ nomad alloc-status 71e73261
ID                  = 71e73261
Eval ID             = 2d27e486
Name                = xxxxxxxx
Node ID             = c313f583
Job ID              = test
Job Version         = 35
Client Status       = complete
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc is lost since its node is down
Created At          = 09/27/17 05:11:09 UTC
Deployment ID       = 12c0c923
Deployment Health   = healthy

Task "xxxxxxxxxxx" is "dead"
Task Resources
CPU      Memory  Disk     IOPS  Addresses
100 MHz  10 MiB  300 MiB  0     http: 10.0.0.162:28060
                                orleans_cluster: 10.0.0.162:24547
                                orleans_proxy: 10.0.0.162:28553

Task Events:
Started At     = 09/27/17 05:13:05 UTC
Finished At    = 09/27/17 05:23:09 UTC
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                               Type                        Description
09/27/17 05:23:08 UTC   Killed                   Task successfully killed
09/27/17 05:23:06 UTC   Killing                  Sent interrupt. Waiting 5s before force killing
09/27/17 05:13:05 UTC   Started                Task started by client
09/27/17 05:10:55 UTC   Downloading Artifacts  Client is downloading artifacts
09/27/17 05:10:55 UTC   Task Setup             Building Task Directory
09/27/17 05:10:55 UTC   Received               Task received by client

In the recent activities its actually showing that its being getting killed.

Nomad Server logs

I find that nomad heartbeat TTL expired .
I'm not quite sure but this is what I found in the nomad server log.


2017/09/27 05:23:17.299258 [WARN] nomad.heartbeat: node 'c313f583-e0ea-9bfb-93f7-f4c0de2fddd8' TTL expired
2017/09/27 05:23:17.474754 [DEBUG] sched: <Eval 'c12ee9f5-dbb4-6938-8eec-2bd1ce17f622' JobID: 'test'>: Total changes: (place 4) (destructive 0) (inplace 0) (stop 4)
Created Deployment: "48bc3767-9a0b-2d6b-407e-f5fdcd1fe66e"
Desired Changes for "service1": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
Desired Changes for "service2": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "service3": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
Desired Changes for "service4": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
Desired Changes for "service5": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "service6": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
Desired Changes for "service7": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "service8": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)

The text was updated successfully, but these errors were encountered:

dadgar · 2017-09-27T17:21:48Z

@smohankarthik Can you try out the Nomad 0.7 beta on the client? We have improved the clients heartbeat reliability. Also it would be good to see the logs from the client around the TTL.

smohankarthik · 2017-09-28T10:06:49Z

@dadgar I changed to beta version and didn't come across Heartbeat issue.

I have a couple of queries.

Just curious is there any way to increase heartbeat manually.? Because I never faced this kind of issue regarding to heartbeat in the past. The deployment was smooth most of the times.
From where can I find the log files in the windows. My old allocation got removed so can’t see the log of the allocation.

Nomad version

Nomad v0.7.0 Beta

Now after upgrading the client version,I am facing another error Driver Failure regarding to rpc failure and timeout while plugin to start .

$ nomad status ed397a74
ID                  = ed397a74
Eval ID             = 09ab1f5a
Name                = xxxxxxx
Node ID             = c313f583
Job ID              = test
Job Version         = 44
Client Status       = failed
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 09/28/17 04:37:48 UTC
Deployment ID       = 41f43c6f
Deployment Health   = unhealthy

Task "service1" is "dead"
Task Resources
CPU      Memory  Disk     IOPS  Addresses
100 MHz  10 MiB  300 MiB  0     http: 10.0.0.162:23367
                                orleans_cluster: 10.0.0.162:20963
                                orleans_proxy: 10.0.0.162:24310

Task Events:
Started At     = 09/28/17 04:53:23 UTC
Finished At    = 09/28/17 04:59:03 UTC
Total Restarts = 3
Last Restart   = 09/28/17 12:57:45 +0800

Recent Events:
Time                   Type            Description
09/28/17 04:59:03 UTC  Not Restarting  Error was unrecoverable
09/28/17 04:59:03 UTC  Driver Failure  failed to start task "service1" for alloc "ed397a74-d9b6-5fde-2f4f-1a363b193881": error creating rpc client for executor plugin: timeout while waiting for plugin to start
09/28/17 04:57:45 UTC  Restarting      Task restarting in 16.458998203s
09/28/17 04:57:45 UTC  Terminated      Exit Code: 3762504530
09/28/17 04:53:23 UTC  Started         Task started by client
09/28/17 04:52:33 UTC  Restarting      Task restarting in 15.246803717s
09/28/17 04:52:33 UTC  Terminated      Exit Code: 3762504530
09/28/17 04:48:09 UTC  Started         Task started by client
09/28/17 04:47:22 UTC  Restarting      Task restarting in 15.099004097s
09/28/17 04:47:22 UTC  Terminated      Exit Code: 3762504530

schmichael · 2017-09-28T16:04:13Z

Just curious is there any way to increase heartbeat manually?

There are a few heartbeat related settings on the server: https://www.nomadproject.io/docs/agent/configuration/server.html#heartbeat_grace

Increasing the grace setting is probably the most straightforward way to give clients more time to heartbeat.

From where can I find the log files in the windows. My old allocation got removed so can’t see the log of the allocation.

Is there any chance you can reproduce this behavior? The logs would be very useful in debugging (including the ${alloc_dir}/${alloc_id}/${task_name}/executor.out log file), and if the allocation has already been GC'd there's really nothing you can do to recover the logs.

mbp · 2017-11-24T11:04:28Z

@smohankarthik

Did you ever figure out a fix? I seem to be having a similar problem after upgrading to Nomad 0.7.0.

Time                   Type        Description
11/24/17 09:57:51 GMT  Killed      Task successfully killed
11/24/17 09:57:51 GMT  Restarting  Task restarting in 18.122063691s
11/24/17 09:57:51 GMT  Terminated  Exit Code: 0, Exit Message: "unexpected EOF"
11/24/17 09:49:08 GMT  Started     Task started by client
11/24/17 09:49:06 GMT  Task Setup  Building Task Directory
11/24/17 09:49:06 GMT  Received    Task received by client```

Time                   Type            Description
11/24/17 09:58:12 GMT  Not Restarting  Error was unrecoverable
11/24/17 09:58:12 GMT  Driver Failure  failed to start task "xxx" for alloc "e3081fa3-c96d-4e5a-7f4d-3580389a3d7a": unable to dispense the executor plugin: EOF
11/24/17 09:57:57 GMT  Task Setup      Building Task Directory
11/24/17 09:57:57 GMT  Received        Task received by client

smohankarthik · 2017-12-06T05:36:13Z

Try to cleaning up everything and retry it . Hope it works for you also, it worked for me. I am not sure what was the issue.

dadgar · 2017-12-06T18:50:56Z

I am going to close this since the original issue seems to be fixed and the other doesn't look related.

github-actions · 2022-12-05T02:16:19Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug theme/client labels Sep 27, 2017

dadgar added the stage/waiting-reply label Oct 17, 2017

dukeland9 mentioned this issue Dec 3, 2017

Under high load, repeated use of allocation name #3593

Closed

smohankarthik closed this as completed Dec 6, 2017

smohankarthik reopened this Dec 6, 2017

dadgar closed this as completed Dec 6, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad allocation getting killed and transferred to another node in the cluster #3289

Nomad allocation getting killed and transferred to another node in the cluster #3289

smohankarthik commented Sep 27, 2017

dadgar commented Sep 27, 2017

smohankarthik commented Sep 28, 2017 •

edited

Loading

schmichael commented Sep 28, 2017

mbp commented Nov 24, 2017

smohankarthik commented Dec 6, 2017

dadgar commented Dec 6, 2017

github-actions bot commented Dec 5, 2022

Nomad allocation getting killed and transferred to another node in the cluster #3289

Nomad allocation getting killed and transferred to another node in the cluster #3289

Comments

smohankarthik commented Sep 27, 2017

Nomad version

Operating system and Environment details

Issue

Nomad Server logs

dadgar commented Sep 27, 2017

smohankarthik commented Sep 28, 2017 • edited Loading

Nomad version

schmichael commented Sep 28, 2017

mbp commented Nov 24, 2017

smohankarthik commented Dec 6, 2017

dadgar commented Dec 6, 2017

github-actions bot commented Dec 5, 2022

smohankarthik commented Sep 28, 2017 •

edited

Loading