You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ubuntu 16.04 LTS (Xenial) running in AWS. Three masters in one AWS region, eight clients in multiple regions.
Issue
After some AWS network instability, we had a job (webhook_delivery_service) spawning dozens of deployments per second. We managed to stabilize the deployment storm with nomad stop webhook_delivery_service and doing a GC to clean out the deployments, but that left our cluster in a state with dozens of running deployments for that service and thousands of failed deployments (and a job index in the 120k range).
We restarted the job, but since then there has been a ton of instability with deployments.
I realize this is a ton of weirdness and my primary concern is getting the cluster back to a good state with respect to this job! Other jobs on the cluster are unaffected, and underneath all of this the affected service itself is operating successfully.
The current state is:
nomad status
Note that the status of the latest deployment is 38 healthy.
ID = webhook_delivery_service
Name = webhook_delivery_service
Submit Date = 10/03/17 14:36:21 UTC
Type = service
Priority = 50
Datacenters = us-west-2,us-west-1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
prod-wds 0 0 3 0 2 0
Latest Deployment
ID = 296fb9ef
Status = running
Description = Deployment is running
Deployed
Task Group Auto Revert Desired Placed Healthy Unhealthy
prod-wds true 3 3 38 0
Allocations
ID Node ID Task Group Version Desired Status Created At
b92727c3 ec2b7499 prod-wds 125090 run running 11/08/17 16:50:06 UTC
c626b1e6 ec2c56f5 prod-wds 125090 run running 10/24/17 19:25:44 UTC
d22ba86b 13c82b37 prod-wds 125090 run running 10/24/17 19:25:24 UTC
nomad deployments list
The failed deployments at the top occurred at about 1 per second before stabilizing on 296fb9ef.
nomad deployment list | grep webhook_delivery_service
296fb9ef webhook_delivery_service 125090 running Deployment is running
fb9c8abf webhook_delivery_service 125089 failed Failed due to unhealthy allocations - rolling back to job version 114687
d5c20b2d webhook_delivery_service 125088 failed Failed due to unhealthy allocations - rolling back to job version 114687
07e29549 webhook_delivery_service 125087 failed Failed due to unhealthy allocations - rolling back to job version 114687
04285cc7 webhook_delivery_service 125086 failed Failed due to unhealthy allocations - rolling back to job version 114687
be6a3312 webhook_delivery_service 125085 failed Failed due to unhealthy allocations - rolling back to job version 114687
bcc6239c webhook_delivery_service 125084 failed Failed due to unhealthy allocations - rolling back to job version 114687
5f4397ba webhook_delivery_service 125083 failed Failed due to unhealthy allocations - rolling back to job version 114687
0b09874f webhook_delivery_service 125082 failed Failed due to unhealthy allocations - rolling back to job version 114687
29b461de webhook_delivery_service 125081 failed Failed due to unhealthy allocations - rolling back to job version 114687
b09ae816 webhook_delivery_service 125080 failed Failed due to unhealthy allocations - rolling back to job version 114687
502b4340 webhook_delivery_service 125079 failed Failed due to unhealthy allocations - rolling back to job version 114687
78d3cde9 webhook_delivery_service 125078 failed Failed due to unhealthy allocations - rolling back to job version 114687
c71ef0a7 webhook_delivery_service 125077 failed Failed due to unhealthy allocations - rolling back to job version 114687
de2d5c48 webhook_delivery_service 125076 failed Failed due to unhealthy allocations - rolling back to job version 114687
c1c95705 webhook_delivery_service 125075 failed Failed due to unhealthy allocations - rolling back to job version 114687
e3bb182d webhook_delivery_service 125074 failed Failed due to unhealthy allocations - rolling back to job version 114687
42cfbccc webhook_delivery_service 125073 failed Failed due to unhealthy allocations - rolling back to job version 114687
a2287c11 webhook_delivery_service 125072 failed Failed due to unhealthy allocations - rolling back to job version 114687
446c5131 webhook_delivery_service 125071 failed Failed due to unhealthy allocations - rolling back to job version 114687
1daa5718 webhook_delivery_service 125070 failed Failed due to unhealthy allocations - rolling back to job version 114687
2b3511f5 webhook_delivery_service 125069 failed Failed due to unhealthy allocations - rolling back to job version 114687
2d85ef46 webhook_delivery_service 125068 failed Failed due to unhealthy allocations - rolling back to job version 114687
837aada6 webhook_delivery_service 125067 failed Failed due to unhealthy allocations - rolling back to job version 114687
5cb86d88 webhook_delivery_service 125066 failed Failed due to unhealthy allocations - rolling back to job version 114687
4ee6ee99 webhook_delivery_service 125065 failed Failed due to unhealthy allocations - rolling back to job version 114687
352e1ac9 webhook_delivery_service 125064 failed Failed due to unhealthy allocations - rolling back to job version 114687
ac97ed66 webhook_delivery_service 125063 failed Failed due to unhealthy allocations - rolling back to job version 114687
07b23a6c webhook_delivery_service 125062 failed Failed due to unhealthy allocations - rolling back to job version 114687
90e0c5e6 webhook_delivery_service 125061 failed Failed due to unhealthy allocations - rolling back to job version 114687
56d00a2f webhook_delivery_service 125060 failed Failed due to unhealthy allocations - rolling back to job version 114687
fc790675 webhook_delivery_service 125059 failed Failed due to unhealthy allocations - rolling back to job version 114687
4a1f6d71 webhook_delivery_service 125058 cancelled Cancelled due to newer version of job
602ba7e2 webhook_delivery_service 125057 failed Failed due to unhealthy allocations - rolling back to job version 114687
3e79d689 webhook_delivery_service 125056 failed Failed due to unhealthy allocations - rolling back to job version 114687
25be5bb5 webhook_delivery_service 125055 failed Failed due to unhealthy allocations - rolling back to job version 114687
e6212e10 webhook_delivery_service 125054 failed Failed due to unhealthy allocations - rolling back to job version 114687
e94bc9f5 webhook_delivery_service 125053 failed Failed due to unhealthy allocations - rolling back to job version 114687
f80adb5d webhook_delivery_service 125052 failed Failed due to unhealthy allocations - rolling back to job version 114687
0ba0070c webhook_delivery_service 125051 failed Failed due to unhealthy allocations - rolling back to job version 114687
12b70616 webhook_delivery_service 125050 failed Failed due to unhealthy allocations - rolling back to job version 114687
4cdd9f71 webhook_delivery_service 125049 failed Failed due to unhealthy allocations - rolling back to job version 114687
59892e63 webhook_delivery_service 125048 failed Failed due to unhealthy allocations - rolling back to job version 114687
e8b7489c webhook_delivery_service 125047 failed Failed due to unhealthy allocations - rolling back to job version 114687
4ac288d2 webhook_delivery_service 125046 failed Failed due to unhealthy allocations - rolling back to job version 114687
f9299f17 webhook_delivery_service 125045 failed Failed due to unhealthy allocations - rolling back to job version 114687
d13eedc0 webhook_delivery_service 125044 failed Failed due to unhealthy allocations - rolling back to job version 114687
7a88c4b8 webhook_delivery_service 125043 failed Failed due to unhealthy allocations - rolling back to job version 114687
d7febabe webhook_delivery_service 125042 failed Failed due to unhealthy allocations - rolling back to job version 114687
4187b44a webhook_delivery_service 125041 failed Failed due to unhealthy allocations - rolling back to job version 114687
d21854d6 webhook_delivery_service 125040 failed Failed due to unhealthy allocations - rolling back to job version 114687
ef3f29a2 webhook_delivery_service 125039 failed Failed due to unhealthy allocations - rolling back to job version 114687
a87e1825 webhook_delivery_service 125038 failed Failed due to unhealthy allocations - rolling back to job version 114687
d423a849 webhook_delivery_service 125037 failed Failed due to unhealthy allocations - rolling back to job version 114687
a0281eb2 webhook_delivery_service 125036 failed Failed due to unhealthy allocations - rolling back to job version 114687
442cdfdf webhook_delivery_service 125035 failed Failed due to unhealthy allocations - rolling back to job version 114687
33b4565d webhook_delivery_service 125034 failed Failed due to unhealthy allocations - rolling back to job version 114687
6283af46 webhook_delivery_service 125033 failed Failed due to unhealthy allocations - rolling back to job version 114687
c34e02ce webhook_delivery_service 125032 failed Failed due to unhealthy allocations - rolling back to job version 114687
a9a7a47c webhook_delivery_service 125031 failed Failed due to unhealthy allocations - rolling back to job version 114687
42ff1bb8 webhook_delivery_service 125030 failed Failed due to unhealthy allocations - rolling back to job version 114687
705cc75d webhook_delivery_service 125029 failed Failed due to unhealthy allocations - rolling back to job version 114687
1bffd7cb webhook_delivery_service 125028 failed Failed due to unhealthy allocations - rolling back to job version 114687
d535d6e2 webhook_delivery_service 125027 failed Failed due to unhealthy allocations - rolling back to job version 114687
c2bd54c6 webhook_delivery_service 125026 failed Failed due to unhealthy allocations - rolling back to job version 114687
515842d2 webhook_delivery_service 125025 failed Failed due to unhealthy allocations - rolling back to job version 114687
51dfa904 webhook_delivery_service 125024 failed Failed due to unhealthy allocations - rolling back to job version 114687
461a7532 webhook_delivery_service 125023 failed Failed due to unhealthy allocations - rolling back to job version 114687
837d6290 webhook_delivery_service 125022 failed Failed due to unhealthy allocations - rolling back to job version 114687
94ed4ca8 webhook_delivery_service 125021 failed Failed due to unhealthy allocations - rolling back to job version 114687
2eff2dc2 webhook_delivery_service 125020 failed Failed due to unhealthy allocations - rolling back to job version 114687
5d4e5c7c webhook_delivery_service 125019 failed Failed due to unhealthy allocations - rolling back to job version 114687
7b0ed293 webhook_delivery_service 125018 failed Failed due to unhealthy allocations - rolling back to job version 114687
f07f5467 webhook_delivery_service 125017 failed Failed due to unhealthy allocations - rolling back to job version 114687
ad56b28f webhook_delivery_service 125016 failed Failed due to unhealthy allocations - rolling back to job version 114687
103c3048 webhook_delivery_service 125015 failed Failed due to unhealthy allocations - rolling back to job version 114687
fe303806 webhook_delivery_service 125014 failed Deployment marked as failed - rolling back to job version 114687
6e90ad5d webhook_delivery_service 125013 failed Failed due to unhealthy allocations - rolling back to job version 114687
ce785673 webhook_delivery_service 125012 failed Failed due to unhealthy allocations - rolling back to job version 114687
16ee79e2 webhook_delivery_service 125011 failed Failed due to unhealthy allocations - rolling back to job version 114687
f36951e9 webhook_delivery_service 125010 failed Failed due to unhealthy allocations - rolling back to job version 114687
aedf7051 webhook_delivery_service 125009 failed Failed due to unhealthy allocations - rolling back to job version 114687
f260ed21 webhook_delivery_service 125008 failed Failed due to unhealthy allocations - rolling back to job version 114687
fbcf4540 webhook_delivery_service 125007 failed Failed due to unhealthy allocations - rolling back to job version 114687
5ae58503 webhook_delivery_service 125006 successful Deployment completed successfully
72c6b97c webhook_delivery_service 125005 successful Deployment completed successfully
a28b09e0 webhook_delivery_service 125004 successful Deployment completed successfully
9222f062 webhook_delivery_service 125003 successful Deployment completed successfully
0cb40809 webhook_delivery_service 125003 successful Deployment completed successfully
1f0bd3b5 webhook_delivery_service 125003 successful Deployment completed successfully
50b5fbb2 webhook_delivery_service 64164 failed Deployment marked as failed - rolling back to job version 114687
177da138 webhook_delivery_service 34500 failed Deployment marked as failed - rolling back to job version 114687
7cb6d780 webhook_delivery_service 24508 failed Deployment marked as failed - rolling back to job version 114687
78f56e02 webhook_delivery_service 17557 failed Deployment marked as failed - rolling back to job version 114687
d6dd5da9 webhook_delivery_service 15722 running Deployment is running
9d0d82ca webhook_delivery_service 15513 running Deployment is running
d4427943 webhook_delivery_service 14298 running Deployment is running
39cfc9c1 webhook_delivery_service 12313 running Deployment is running
a3d3fd50 webhook_delivery_service 9799 running Deployment is running
ab259d32 webhook_delivery_service 8649 running Deployment is running
9e6830c4 webhook_delivery_service 8429 running Deployment is running
fdef5a31 webhook_delivery_service 8393 running Deployment is running
6eb90166 webhook_delivery_service 8384 running Deployment is running
4a00520a webhook_delivery_service 7592 running Deployment is running
43a62560 webhook_delivery_service 4246 running Deployment is running
b83fb198 webhook_delivery_service 3930 running Deployment is running
e8a22003 webhook_delivery_service 3463 running Deployment is running
83da9116 webhook_delivery_service 3217 running Deployment is running
70a08a7a webhook_delivery_service 3141 running Deployment is running
06739c5f webhook_delivery_service 3018 running Deployment is running
cf7fcca1 webhook_delivery_service 2973 running Deployment is running
2f97fab1 webhook_delivery_service 2942 running Deployment is running
35502659 webhook_delivery_service 2907 running Deployment is running
842f385b webhook_delivery_service 2863 running Deployment is running
14131060 webhook_delivery_service 2792 running Deployment is running
5c3de641 webhook_delivery_service 1593 running Deployment is running
566e3ed4 webhook_delivery_service 697 running Deployment is running
7ed79925 webhook_delivery_service 626 running Deployment is running
c61834e4 webhook_delivery_service 155 running Deployment is running
22b577b6 webhook_delivery_service 151 running Deployment is running
a2e73d83 webhook_delivery_service 140 running Deployment is running
6c4170c3 webhook_delivery_service 85 running Deployment is running
e6e77e04 webhook_delivery_service 70 running Deployment is running
nomad deployment status
Here's the status of a failed deployment:
!!!! rlafferty@prod-clustermgr-master10:~ $ nomad deployment status fb9c8abf
ID = fb9c8abf
Job ID = webhook_delivery_service
Job Version = 125089
Status = failed
Description = Failed due to unhealthy allocations - rolling back to job version 114687
Deployed
Task Group Auto Revert Desired Placed Healthy Unhealthy
prod-wds true 3 3 2 1
Of the most recent deployment, stuck in running state:
!!!! rlafferty@prod-clustermgr-master10:~ $ nomad deployment status 296fb9ef
ID = 296fb9ef
Job ID = webhook_delivery_service
Job Version = 125090
Status = running
Description = Deployment is running
Deployed
Task Group Auto Revert Desired Placed Healthy Unhealthy
prod-wds true 3 3 38 0
Of one of the stuck running deployments from much earlier:
!!!! rlafferty@prod-clustermgr-master10:~ $ nomad deployment status e6e77e04
ID = e6e77e04
Job ID = webhook_delivery_service
Job Version = 70
Status = running
Description = Deployment is running
Deployed
Task Group Auto Revert Desired Placed Healthy Unhealthy
prod-wds true 3 3 0 0
nomad alloc-status
ID = b92727c3
Eval ID = 9006baf2
Name = webhook_delivery_service.prod-wds[0]
Node ID = ec2b7499
Job ID = webhook_delivery_service
Job Version = 125090
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 11/08/17 16:50:06 UTC
Deployment ID = 296fb9ef
Deployment Health = healthy
Task "wds" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
30/250 MHz 95 MiB/512 MiB 300 MiB 0 http: 172.18.0.156:21818
Task Events:
Started At = 11/08/17 16:50:12 UTC
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
11/08/17 16:50:12 UTC Started Task started by client
11/08/17 16:50:07 UTC Driver Downloading image pagerduty-docker.jfrog.io/webhook_delivery_service:1f8c2c9613cdf162670b16c84bd6de12d85417fa
11/08/17 16:50:06 UTC Task Setup Building Task Directory
11/08/17 16:50:06 UTC Received Task received by client
!!!! rlafferty@prod-clustermgr-master10:~ $ nomad alloc-status c626b1e6
ID = c626b1e6
Eval ID = 9006baf2
Name = webhook_delivery_service.prod-wds[2]
Node ID = ec2c56f5
Job ID = webhook_delivery_service
Job Version = 125090
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 10/24/17 19:25:44 UTC
Deployment ID = 296fb9ef
Deployment Health = healthy
Task "wds" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
17/250 MHz 115 MiB/512 MiB 300 MiB 0 http: 172.18.0.52:28934
Task Events:
Started At = 10/24/17 19:25:55 UTC
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
10/24/17 19:25:55 UTC Started Task started by client
10/24/17 19:25:47 UTC Driver Downloading image pagerduty-docker.jfrog.io/webhook_delivery_service:1f8c2c9613cdf162670b16c84bd6de12d85417fa
10/24/17 19:25:44 UTC Task Setup Building Task Directory
10/24/17 19:25:44 UTC Received Task received by client
!!!! rlafferty@prod-clustermgr-master10:~ $ nomad alloc-status d22ba86b
ID = d22ba86b
Eval ID = 9006baf2
Name = webhook_delivery_service.prod-wds[1]
Node ID = 13c82b37
Job ID = webhook_delivery_service
Job Version = 125090
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 10/24/17 19:25:24 UTC
Deployment ID = 296fb9ef
Deployment Health = unset
Task "wds" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
21/250 MHz 90 MiB/512 MiB 300 MiB 0 http: 172.19.128.18:23160
Task Events:
Started At = 11/08/17 03:13:43 UTC
Finished At = N/A
Total Restarts = 1
Last Restart = 11/08/17 03:13:10 UTC
Recent Events:
Time Type Description
11/08/17 03:13:43 UTC Started Task started by client
11/08/17 03:13:10 UTC Restarting Task restarting in 32.712091334s
11/08/17 03:13:10 UTC Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
10/24/17 19:25:33 UTC Started Task started by client
10/24/17 19:25:27 UTC Driver Downloading image pagerduty-docker.jfrog.io/webhook_delivery_service:1f8c2c9613cdf162670b16c84bd6de12d85417fa
10/24/17 19:25:24 UTC Task Setup Building Task Directory
10/24/17 19:25:24 UTC Received Task received by client
Interesting log entries
There isn't a lot of logging that's obviously connected to these allocations/deployments, but one thing that did catch my eye is this, repeated from clients whose allocations' health is in unset state:
Nov 08 16:44:08 prod-clustermgr-client34 nomad[2045]: 2017/11/08 16:44:08.654482 [WARN] client: failed to broadcast update to allocation "e5d4f32e-ce51-fbb7-bed3-d460770f01a2"
During the initial incident that led to this state, we saw tons of failed to broadcast update to allocation messages as well as lots of
[ERR] client: dropping update to alloc 'd4da04f2-4d8c-c4fa-22ff-8f737e52220d'
messages, all referring to allocations associated with this job.
I have seen this also, I can only suggest to upgrade to 0.6.3 that has much more stable deployments where I have noticed this issue was resolved somewhere.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v0.6.0
Operating system and Environment details
Ubuntu 16.04 LTS (Xenial) running in AWS. Three masters in one AWS region, eight clients in multiple regions.
Issue
After some AWS network instability, we had a job (webhook_delivery_service) spawning dozens of deployments per second. We managed to stabilize the deployment storm with
nomad stop webhook_delivery_service
and doing a GC to clean out the deployments, but that left our cluster in a state with dozens of running deployments for that service and thousands of failed deployments (and a job index in the 120k range).We restarted the job, but since then there has been a ton of instability with deployments.
I realize this is a ton of weirdness and my primary concern is getting the cluster back to a good state with respect to this job! Other jobs on the cluster are unaffected, and underneath all of this the affected service itself is operating successfully.
The current state is:
nomad status
Note that the status of the latest deployment is 38 healthy.
nomad deployments list
The failed deployments at the top occurred at about 1 per second before stabilizing on 296fb9ef.
nomad deployment status
Here's the status of a failed deployment:
Of the most recent deployment, stuck in
running
state:Of one of the stuck running deployments from much earlier:
nomad alloc-status
Interesting log entries
There isn't a lot of logging that's obviously connected to these allocations/deployments, but one thing that did catch my eye is this, repeated from clients whose allocations' health is in
unset
state:Nov 08 16:44:08 prod-clustermgr-client34 nomad[2045]: 2017/11/08 16:44:08.654482 [WARN] client: failed to broadcast update to allocation "e5d4f32e-ce51-fbb7-bed3-d460770f01a2"
During the initial incident that led to this state, we saw tons of
failed to broadcast update to allocation
messages as well as lots of[ERR] client: dropping update to alloc 'd4da04f2-4d8c-c4fa-22ff-8f737e52220d'
messages, all referring to allocations associated with this job.
Job file
The text was updated successfully, but these errors were encountered: