Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad deployment storm after network segmentation with residual effects #3525

Closed
wlonkly opened this issue Nov 8, 2017 · 3 comments
Closed

Comments

@wlonkly
Copy link

wlonkly commented Nov 8, 2017

Nomad version

Nomad v0.6.0

Operating system and Environment details

Ubuntu 16.04 LTS (Xenial) running in AWS. Three masters in one AWS region, eight clients in multiple regions.

Issue

After some AWS network instability, we had a job (webhook_delivery_service) spawning dozens of deployments per second. We managed to stabilize the deployment storm with nomad stop webhook_delivery_service and doing a GC to clean out the deployments, but that left our cluster in a state with dozens of running deployments for that service and thousands of failed deployments (and a job index in the 120k range).

We restarted the job, but since then there has been a ton of instability with deployments.

I realize this is a ton of weirdness and my primary concern is getting the cluster back to a good state with respect to this job! Other jobs on the cluster are unaffected, and underneath all of this the affected service itself is operating successfully.

The current state is:

nomad status

Note that the status of the latest deployment is 38 healthy.

ID            = webhook_delivery_service
Name          = webhook_delivery_service
Submit Date   = 10/03/17 14:36:21 UTC
Type          = service
Priority      = 50
Datacenters   = us-west-2,us-west-1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
prod-wds    0       0         3        0       2         0

Latest Deployment
ID          = 296fb9ef
Status      = running
Description = Deployment is running

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy
prod-wds    true         3        3       38       0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created At
b92727c3  ec2b7499  prod-wds    125090   run      running  11/08/17 16:50:06 UTC
c626b1e6  ec2c56f5  prod-wds    125090   run      running  10/24/17 19:25:44 UTC
d22ba86b  13c82b37  prod-wds    125090   run      running  10/24/17 19:25:24 UTC

nomad deployments list

The failed deployments at the top occurred at about 1 per second before stabilizing on 296fb9ef.

nomad deployment list | grep webhook_delivery_service
296fb9ef  webhook_delivery_service  125090       running     Deployment is running
fb9c8abf  webhook_delivery_service  125089       failed      Failed due to unhealthy allocations - rolling back to job version 114687
d5c20b2d  webhook_delivery_service  125088       failed      Failed due to unhealthy allocations - rolling back to job version 114687
07e29549  webhook_delivery_service  125087       failed      Failed due to unhealthy allocations - rolling back to job version 114687
04285cc7  webhook_delivery_service  125086       failed      Failed due to unhealthy allocations - rolling back to job version 114687
be6a3312  webhook_delivery_service  125085       failed      Failed due to unhealthy allocations - rolling back to job version 114687
bcc6239c  webhook_delivery_service  125084       failed      Failed due to unhealthy allocations - rolling back to job version 114687
5f4397ba  webhook_delivery_service  125083       failed      Failed due to unhealthy allocations - rolling back to job version 114687
0b09874f  webhook_delivery_service  125082       failed      Failed due to unhealthy allocations - rolling back to job version 114687
29b461de  webhook_delivery_service  125081       failed      Failed due to unhealthy allocations - rolling back to job version 114687
b09ae816  webhook_delivery_service  125080       failed      Failed due to unhealthy allocations - rolling back to job version 114687
502b4340  webhook_delivery_service  125079       failed      Failed due to unhealthy allocations - rolling back to job version 114687
78d3cde9  webhook_delivery_service  125078       failed      Failed due to unhealthy allocations - rolling back to job version 114687
c71ef0a7  webhook_delivery_service  125077       failed      Failed due to unhealthy allocations - rolling back to job version 114687
de2d5c48  webhook_delivery_service  125076       failed      Failed due to unhealthy allocations - rolling back to job version 114687
c1c95705  webhook_delivery_service  125075       failed      Failed due to unhealthy allocations - rolling back to job version 114687
e3bb182d  webhook_delivery_service  125074       failed      Failed due to unhealthy allocations - rolling back to job version 114687
42cfbccc  webhook_delivery_service  125073       failed      Failed due to unhealthy allocations - rolling back to job version 114687
a2287c11  webhook_delivery_service  125072       failed      Failed due to unhealthy allocations - rolling back to job version 114687
446c5131  webhook_delivery_service  125071       failed      Failed due to unhealthy allocations - rolling back to job version 114687
1daa5718  webhook_delivery_service  125070       failed      Failed due to unhealthy allocations - rolling back to job version 114687
2b3511f5  webhook_delivery_service  125069       failed      Failed due to unhealthy allocations - rolling back to job version 114687
2d85ef46  webhook_delivery_service  125068       failed      Failed due to unhealthy allocations - rolling back to job version 114687
837aada6  webhook_delivery_service  125067       failed      Failed due to unhealthy allocations - rolling back to job version 114687
5cb86d88  webhook_delivery_service  125066       failed      Failed due to unhealthy allocations - rolling back to job version 114687
4ee6ee99  webhook_delivery_service  125065       failed      Failed due to unhealthy allocations - rolling back to job version 114687
352e1ac9  webhook_delivery_service  125064       failed      Failed due to unhealthy allocations - rolling back to job version 114687
ac97ed66  webhook_delivery_service  125063       failed      Failed due to unhealthy allocations - rolling back to job version 114687
07b23a6c  webhook_delivery_service  125062       failed      Failed due to unhealthy allocations - rolling back to job version 114687
90e0c5e6  webhook_delivery_service  125061       failed      Failed due to unhealthy allocations - rolling back to job version 114687
56d00a2f  webhook_delivery_service  125060       failed      Failed due to unhealthy allocations - rolling back to job version 114687
fc790675  webhook_delivery_service  125059       failed      Failed due to unhealthy allocations - rolling back to job version 114687
4a1f6d71  webhook_delivery_service  125058       cancelled   Cancelled due to newer version of job
602ba7e2  webhook_delivery_service  125057       failed      Failed due to unhealthy allocations - rolling back to job version 114687
3e79d689  webhook_delivery_service  125056       failed      Failed due to unhealthy allocations - rolling back to job version 114687
25be5bb5  webhook_delivery_service  125055       failed      Failed due to unhealthy allocations - rolling back to job version 114687
e6212e10  webhook_delivery_service  125054       failed      Failed due to unhealthy allocations - rolling back to job version 114687
e94bc9f5  webhook_delivery_service  125053       failed      Failed due to unhealthy allocations - rolling back to job version 114687
f80adb5d  webhook_delivery_service  125052       failed      Failed due to unhealthy allocations - rolling back to job version 114687
0ba0070c  webhook_delivery_service  125051       failed      Failed due to unhealthy allocations - rolling back to job version 114687
12b70616  webhook_delivery_service  125050       failed      Failed due to unhealthy allocations - rolling back to job version 114687
4cdd9f71  webhook_delivery_service  125049       failed      Failed due to unhealthy allocations - rolling back to job version 114687
59892e63  webhook_delivery_service  125048       failed      Failed due to unhealthy allocations - rolling back to job version 114687
e8b7489c  webhook_delivery_service  125047       failed      Failed due to unhealthy allocations - rolling back to job version 114687
4ac288d2  webhook_delivery_service  125046       failed      Failed due to unhealthy allocations - rolling back to job version 114687
f9299f17  webhook_delivery_service  125045       failed      Failed due to unhealthy allocations - rolling back to job version 114687
d13eedc0  webhook_delivery_service  125044       failed      Failed due to unhealthy allocations - rolling back to job version 114687
7a88c4b8  webhook_delivery_service  125043       failed      Failed due to unhealthy allocations - rolling back to job version 114687
d7febabe  webhook_delivery_service  125042       failed      Failed due to unhealthy allocations - rolling back to job version 114687
4187b44a  webhook_delivery_service  125041       failed      Failed due to unhealthy allocations - rolling back to job version 114687
d21854d6  webhook_delivery_service  125040       failed      Failed due to unhealthy allocations - rolling back to job version 114687
ef3f29a2  webhook_delivery_service  125039       failed      Failed due to unhealthy allocations - rolling back to job version 114687
a87e1825  webhook_delivery_service  125038       failed      Failed due to unhealthy allocations - rolling back to job version 114687
d423a849  webhook_delivery_service  125037       failed      Failed due to unhealthy allocations - rolling back to job version 114687
a0281eb2  webhook_delivery_service  125036       failed      Failed due to unhealthy allocations - rolling back to job version 114687
442cdfdf  webhook_delivery_service  125035       failed      Failed due to unhealthy allocations - rolling back to job version 114687
33b4565d  webhook_delivery_service  125034       failed      Failed due to unhealthy allocations - rolling back to job version 114687
6283af46  webhook_delivery_service  125033       failed      Failed due to unhealthy allocations - rolling back to job version 114687
c34e02ce  webhook_delivery_service  125032       failed      Failed due to unhealthy allocations - rolling back to job version 114687
a9a7a47c  webhook_delivery_service  125031       failed      Failed due to unhealthy allocations - rolling back to job version 114687
42ff1bb8  webhook_delivery_service  125030       failed      Failed due to unhealthy allocations - rolling back to job version 114687
705cc75d  webhook_delivery_service  125029       failed      Failed due to unhealthy allocations - rolling back to job version 114687
1bffd7cb  webhook_delivery_service  125028       failed      Failed due to unhealthy allocations - rolling back to job version 114687
d535d6e2  webhook_delivery_service  125027       failed      Failed due to unhealthy allocations - rolling back to job version 114687
c2bd54c6  webhook_delivery_service  125026       failed      Failed due to unhealthy allocations - rolling back to job version 114687
515842d2  webhook_delivery_service  125025       failed      Failed due to unhealthy allocations - rolling back to job version 114687
51dfa904  webhook_delivery_service  125024       failed      Failed due to unhealthy allocations - rolling back to job version 114687
461a7532  webhook_delivery_service  125023       failed      Failed due to unhealthy allocations - rolling back to job version 114687
837d6290  webhook_delivery_service  125022       failed      Failed due to unhealthy allocations - rolling back to job version 114687
94ed4ca8  webhook_delivery_service  125021       failed      Failed due to unhealthy allocations - rolling back to job version 114687
2eff2dc2  webhook_delivery_service  125020       failed      Failed due to unhealthy allocations - rolling back to job version 114687
5d4e5c7c  webhook_delivery_service  125019       failed      Failed due to unhealthy allocations - rolling back to job version 114687
7b0ed293  webhook_delivery_service  125018       failed      Failed due to unhealthy allocations - rolling back to job version 114687
f07f5467  webhook_delivery_service  125017       failed      Failed due to unhealthy allocations - rolling back to job version 114687
ad56b28f  webhook_delivery_service  125016       failed      Failed due to unhealthy allocations - rolling back to job version 114687
103c3048  webhook_delivery_service  125015       failed      Failed due to unhealthy allocations - rolling back to job version 114687
fe303806  webhook_delivery_service  125014       failed      Deployment marked as failed - rolling back to job version 114687
6e90ad5d  webhook_delivery_service  125013       failed      Failed due to unhealthy allocations - rolling back to job version 114687
ce785673  webhook_delivery_service  125012       failed      Failed due to unhealthy allocations - rolling back to job version 114687
16ee79e2  webhook_delivery_service  125011       failed      Failed due to unhealthy allocations - rolling back to job version 114687
f36951e9  webhook_delivery_service  125010       failed      Failed due to unhealthy allocations - rolling back to job version 114687
aedf7051  webhook_delivery_service  125009       failed      Failed due to unhealthy allocations - rolling back to job version 114687
f260ed21  webhook_delivery_service  125008       failed      Failed due to unhealthy allocations - rolling back to job version 114687
fbcf4540  webhook_delivery_service  125007       failed      Failed due to unhealthy allocations - rolling back to job version 114687
5ae58503  webhook_delivery_service  125006       successful  Deployment completed successfully
72c6b97c  webhook_delivery_service  125005       successful  Deployment completed successfully
a28b09e0  webhook_delivery_service  125004       successful  Deployment completed successfully
9222f062  webhook_delivery_service  125003       successful  Deployment completed successfully
0cb40809  webhook_delivery_service  125003       successful  Deployment completed successfully
1f0bd3b5  webhook_delivery_service  125003       successful  Deployment completed successfully
50b5fbb2  webhook_delivery_service  64164        failed      Deployment marked as failed - rolling back to job version 114687
177da138  webhook_delivery_service  34500        failed      Deployment marked as failed - rolling back to job version 114687
7cb6d780  webhook_delivery_service  24508        failed      Deployment marked as failed - rolling back to job version 114687
78f56e02  webhook_delivery_service  17557        failed      Deployment marked as failed - rolling back to job version 114687
d6dd5da9  webhook_delivery_service  15722        running     Deployment is running
9d0d82ca  webhook_delivery_service  15513        running     Deployment is running
d4427943  webhook_delivery_service  14298        running     Deployment is running
39cfc9c1  webhook_delivery_service  12313        running     Deployment is running
a3d3fd50  webhook_delivery_service  9799         running     Deployment is running
ab259d32  webhook_delivery_service  8649         running     Deployment is running
9e6830c4  webhook_delivery_service  8429         running     Deployment is running
fdef5a31  webhook_delivery_service  8393         running     Deployment is running
6eb90166  webhook_delivery_service  8384         running     Deployment is running
4a00520a  webhook_delivery_service  7592         running     Deployment is running
43a62560  webhook_delivery_service  4246         running     Deployment is running
b83fb198  webhook_delivery_service  3930         running     Deployment is running
e8a22003  webhook_delivery_service  3463         running     Deployment is running
83da9116  webhook_delivery_service  3217         running     Deployment is running
70a08a7a  webhook_delivery_service  3141         running     Deployment is running
06739c5f  webhook_delivery_service  3018         running     Deployment is running
cf7fcca1  webhook_delivery_service  2973         running     Deployment is running
2f97fab1  webhook_delivery_service  2942         running     Deployment is running
35502659  webhook_delivery_service  2907         running     Deployment is running
842f385b  webhook_delivery_service  2863         running     Deployment is running
14131060  webhook_delivery_service  2792         running     Deployment is running
5c3de641  webhook_delivery_service  1593         running     Deployment is running
566e3ed4  webhook_delivery_service  697          running     Deployment is running
7ed79925  webhook_delivery_service  626          running     Deployment is running
c61834e4  webhook_delivery_service  155          running     Deployment is running
22b577b6  webhook_delivery_service  151          running     Deployment is running
a2e73d83  webhook_delivery_service  140          running     Deployment is running
6c4170c3  webhook_delivery_service  85           running     Deployment is running
e6e77e04  webhook_delivery_service  70           running     Deployment is running

nomad deployment status

Here's the status of a failed deployment:

!!!! rlafferty@prod-clustermgr-master10:~ $ nomad deployment status fb9c8abf
ID          = fb9c8abf
Job ID      = webhook_delivery_service
Job Version = 125089
Status      = failed
Description = Failed due to unhealthy allocations - rolling back to job version 114687

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy
prod-wds    true         3        3       2        1

Of the most recent deployment, stuck in running state:

!!!! rlafferty@prod-clustermgr-master10:~ $ nomad deployment status 296fb9ef
ID          = 296fb9ef
Job ID      = webhook_delivery_service
Job Version = 125090
Status      = running
Description = Deployment is running

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy
prod-wds    true         3        3       38       0

Of one of the stuck running deployments from much earlier:

!!!! rlafferty@prod-clustermgr-master10:~ $ nomad deployment status e6e77e04
ID          = e6e77e04
Job ID      = webhook_delivery_service
Job Version = 70
Status      = running
Description = Deployment is running

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy
prod-wds    true         3        3       0        0

nomad alloc-status

ID                  = b92727c3
Eval ID             = 9006baf2
Name                = webhook_delivery_service.prod-wds[0]
Node ID             = ec2b7499
Job ID              = webhook_delivery_service
Job Version         = 125090
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 11/08/17 16:50:06 UTC
Deployment ID       = 296fb9ef
Deployment Health   = healthy

Task "wds" is "running"
Task Resources
CPU         Memory          Disk     IOPS  Addresses
30/250 MHz  95 MiB/512 MiB  300 MiB  0     http: 172.18.0.156:21818

Task Events:
Started At     = 11/08/17 16:50:12 UTC
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                   Type        Description
11/08/17 16:50:12 UTC  Started     Task started by client
11/08/17 16:50:07 UTC  Driver      Downloading image pagerduty-docker.jfrog.io/webhook_delivery_service:1f8c2c9613cdf162670b16c84bd6de12d85417fa
11/08/17 16:50:06 UTC  Task Setup  Building Task Directory
11/08/17 16:50:06 UTC  Received    Task received by client

!!!! rlafferty@prod-clustermgr-master10:~ $ nomad alloc-status c626b1e6
ID                  = c626b1e6
Eval ID             = 9006baf2
Name                = webhook_delivery_service.prod-wds[2]
Node ID             = ec2c56f5
Job ID              = webhook_delivery_service
Job Version         = 125090
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 10/24/17 19:25:44 UTC
Deployment ID       = 296fb9ef
Deployment Health   = healthy

Task "wds" is "running"
Task Resources
CPU         Memory           Disk     IOPS  Addresses
17/250 MHz  115 MiB/512 MiB  300 MiB  0     http: 172.18.0.52:28934

Task Events:
Started At     = 10/24/17 19:25:55 UTC
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                   Type        Description
10/24/17 19:25:55 UTC  Started     Task started by client
10/24/17 19:25:47 UTC  Driver      Downloading image pagerduty-docker.jfrog.io/webhook_delivery_service:1f8c2c9613cdf162670b16c84bd6de12d85417fa
10/24/17 19:25:44 UTC  Task Setup  Building Task Directory
10/24/17 19:25:44 UTC  Received    Task received by client

!!!! rlafferty@prod-clustermgr-master10:~ $ nomad alloc-status d22ba86b
ID                  = d22ba86b
Eval ID             = 9006baf2
Name                = webhook_delivery_service.prod-wds[1]
Node ID             = 13c82b37
Job ID              = webhook_delivery_service
Job Version         = 125090
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 10/24/17 19:25:24 UTC
Deployment ID       = 296fb9ef
Deployment Health   = unset

Task "wds" is "running"
Task Resources
CPU         Memory          Disk     IOPS  Addresses
21/250 MHz  90 MiB/512 MiB  300 MiB  0     http: 172.19.128.18:23160

Task Events:
Started At     = 11/08/17 03:13:43 UTC
Finished At    = N/A
Total Restarts = 1
Last Restart   = 11/08/17 03:13:10 UTC

Recent Events:
Time                   Type        Description
11/08/17 03:13:43 UTC  Started     Task started by client
11/08/17 03:13:10 UTC  Restarting  Task restarting in 32.712091334s
11/08/17 03:13:10 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
10/24/17 19:25:33 UTC  Started     Task started by client
10/24/17 19:25:27 UTC  Driver      Downloading image pagerduty-docker.jfrog.io/webhook_delivery_service:1f8c2c9613cdf162670b16c84bd6de12d85417fa
10/24/17 19:25:24 UTC  Task Setup  Building Task Directory
10/24/17 19:25:24 UTC  Received    Task received by client

Interesting log entries

There isn't a lot of logging that's obviously connected to these allocations/deployments, but one thing that did catch my eye is this, repeated from clients whose allocations' health is in unset state:

Nov 08 16:44:08 prod-clustermgr-client34 nomad[2045]: 2017/11/08 16:44:08.654482 [WARN] client: failed to broadcast update to allocation "e5d4f32e-ce51-fbb7-bed3-d460770f01a2"

During the initial incident that led to this state, we saw tons of failed to broadcast update to allocation messages as well as lots of

[ERR] client: dropping update to alloc 'd4da04f2-4d8c-c4fa-22ff-8f737e52220d'

messages, all referring to allocations associated with this job.

Job file

job "webhook_delivery_service" {
  type = "service"

  datacenters = ["us-west-2", "us-west-1"]

  update {
    max_parallel = 1
    min_healthy_time = "30s"
    healthy_deadline = "3m"
    stagger = "30s"
    auto_revert = true
  }

  group "prod-wds" {
    count = 3

    task "wds" {
      driver = "docker"

      config {
        image = "pagerduty-docker.jfrog.io/webhook_delivery_service:${DEPLOY_SUB___VERSION}"

        port_map {
          http = 10006
        }

        logging {
          type = "journald"

          config {
            tag = "${NOMAD_META_SPLUNK_INDEX}:${NOMAD_ALLOC_NAME}.${NOMAD_TASK_NAME}.${NOMAD_ALLOC_ID}"
          }
        }
      }

      service {
        name = "webhook-delivery-service"
        tags = ["prod-webhook_delivery_service"]
        port = "http"

        check {
          type     = "http"
          port     = "http"
          protocol = "http"
          path     = "/health"
          interval = "5s"
          timeout  = "2s"
        }
      }

      vault {
        policies = ["webhook-delivery-service"]
      }

      resources {
        cpu    = 250
        memory = 512

        network {
          mbits = 100

          port "http" {}
        }
      }

      env {
        MIX_ENV        = "prod"
        STATSD_HOST    = "${attr.driver.docker.bridge_ip}"
        WDS_KAFKA_HOST = "prod-bitpipe.kafka.service.consul"
        WEB_DOMAIN     = "pagerduty"
      }

      meta {
        SPLUNK_INDEX = "wds"
      }
    }

    restart {
      interval = "5m"
      attempts = 3
      delay    = "30s"
    }
  }
}```
@discobean
Copy link

I have seen this also, I can only suggest to upgrade to 0.6.3 that has much more stable deployments where I have noticed this issue was resolved somewhere.

@dadgar
Copy link
Contributor

dadgar commented Nov 9, 2017

Hey this has been fixed by #3496 and will land in 0.7.1

@dadgar dadgar closed this as completed Nov 9, 2017
@github-actions
Copy link

github-actions bot commented Dec 6, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants