Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade #7320

Closed
dpn opened this issue Mar 11, 2020 · 10 comments · Fixed by #7383
Closed

Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade #7320

dpn opened this issue Mar 11, 2020 · 10 comments · Fixed by #7383

Comments

@dpn
Copy link

dpn commented Mar 11, 2020

Nomad version

Found on

$ nomad version
Nomad v0.9.6 (1f8eddf2211d064b150f141c86e30d9fceabec89)

Also repros on these version in our test clusters

$ nomad version
Nomad v0.9.7 (0e0eb07c53f99f54bcdb2e69aa8a9690a0597e7a)
$ nomad version
Nomad v0.10.4 (f750636ca68e17dcd2445c1ab9c5a34f9ac69345)

Operating system and Environment details

Originally found in AWS:

  • 3x Nomad Servers (v0.9.6)
  • On the order of 100s of Nomad clients
  • Consul v1.6.3

Reproduced on colocated hardware:

  • 3x Nomad servers (v0.9.7 and v0.10.4)
  • On the order of 100s of Nomad clients
  • Consul v1.6.3

Issue

The issue was discovered when one of our engineers pushed out a deployment where the replacement allocs were failing their healthchecks due to improperly configured Security Groups in AWS, yet Nomad continued to replace the healthy allocs with unhealthy ones until the entire service was down.

In the repro steps it seems that Nomad thinks these replacement allocations are healthy when they're not and this seems to be triggered when the replacement alloc is restarted by the service CheckRestart stanza. Another thing to note is that this doesn't reproduce with a single task job, multiple tasks are required for this behavior.

I'm not seeing any issues with the config that would lead to this behavior, but entirely possible I've overlooked something.

Reproduction steps

  1. Submit stable job

    $ curl -v -X PUT -H "X-Nomad-Token: $NOMAD_TOKEN" -d @test-nomad-rolling-upgrade-ok.json https://$HOSTNAME:4646/v1/job/test-nomad-rolling-deployments
    
  2. Wait for initial deployment to succeed:

    ID            = test-nomad-rolling-deployments
    Name          = test-nomad-rolling-deployments
    Submit Date   = 2020-03-11T17:07:36Z
    Type          = service
    Priority      = 50
    Datacenters   = a-dc
    Status        = running
    Periodic      = false
    Parameterized = false
    
    Summary
    Task Group                      Queued  Starting  Running  Failed  Complete  Lost
    test-nomad-rolling-deployments  0       0         3        0       0         0
    
    Latest Deployment
    ID          = b0343364
    Status      = successful
    Description = Deployment completed successfully
    
    Deployed
    Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    test-nomad-rolling-deployments  true         3        3       3        0          2020-03-11T11:24:02-06:00
    
    Allocations
    ID        Node ID   Task Group                      Version  Desired  Status   Created  Modified
    1081ddfd  cced5c02  test-nomad-rolling-deployments  0        run      running  38s ago  12s ago
    a2759873  0faae4b3  test-nomad-rolling-deployments  0        run      running  38s ago  17s ago
    ba6b37bc  ed21263f  test-nomad-rolling-deployments  0        run      running  38s ago  18s ago
    
  3. Modify the job file to tweak the cpu allocation (this forces a new deployment, simulating a Docker image version bump) and break the healthcheck on one of the allocations by tweaking the healthcheck path

    $ diff test-nomad-rolling-upgrade-ok.json test-nomad-rolling-upgrade-not-ok.json
    18c18
    <                             "CPU": 48,
    ---
    >                             "CPU": 24,
    42c42
    <                                         "Path": "/healthcheck-ok",
    ---
    >                                         "Path": "/healthcheck-not-ok",
    
  4. Submit the updated job

    $ curl -v -X PUT -H "X-Nomad-Token: $NOMAD_TOKEN" -d @test-nomad-rolling-upgrade-not-ok.json  https://$HOSTNAME:4646/v1/job/test-nomad-rolling-deployments
    
  5. Deployment begins by creating a new alloc

    ID            = test-nomad-rolling-deployments
    Name          = test-nomad-rolling-deployments
    Submit Date   = 2020-03-11T17:17:33Z
    Type          = service
    Priority      = 50
    Datacenters   = a-dc
    Status        = running
    Periodic      = false
    Parameterized = false
    
    Summary
    Task Group                      Queued  Starting  Running  Failed  Complete  Lost
    test-nomad-rolling-deployments  0       0         3        0       1         0
    
    Latest Deployment
    ID          = d76e2a0e
    Status      = running
    Description = Deployment is running
    
    Deployed
    Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    test-nomad-rolling-deployments  true         3        1       0        0          2020-03-11T17:33:34Z
    
    Allocations
    ID        Node ID   Task Group                      Version  Desired  Status    Created     Modified
    cf26ce6c  c38e9054  test-nomad-rolling-deployments  1        run      running   24s ago     13s ago
    1081ddfd  cced5c02  test-nomad-rolling-deployments  0        stop     complete  10m22s ago  18s ago
    a2759873  0faae4b3  test-nomad-rolling-deployments  0        run      running   10m22s ago  10m1s ago
    ba6b37bc  ed21263f  test-nomad-rolling-deployments  0        run      running   10m22s ago  10m2s ago
    
  6. CheckRestart stanza takes effect, restarting the new alloc:

    ID                  = cf26ce6c
    Eval ID             = 187a10ba
    Name                = test-nomad-rolling-deployments.test-nomad-rolling-deployments[0]
    Node ID             = c38e9054
    Node Name           = a.node.tld
    Job ID              = test-nomad-rolling-deployments
    Job Version         = 824638703088
    Client Status       = running
    Client Description  = Tasks are running
    Desired Status      = run
    Desired Description = <none>
    Created             = 1m49s ago
    Modified            = 23s ago
    Deployment ID       = d76e2a0e
    Deployment Health   = healthy
    
    Task "main" is "running"
    Task Resources
    CPU       Memory          Disk     Addresses
    0/24 MHz  1.2 MiB/32 MiB  300 MiB  main_port: 10.4.26.15:23739
    
    Task Events:
    Started At     = 2020-03-11T17:18:59Z
    Finished At    = N/A
    Total Restarts = 1
    Last Restart   = 2020-03-11T11:18:41-06:00
    
    Recent Events:
    Time                  Type              Description
    2020-03-11T17:18:59Z  Started           Task started by client
    2020-03-11T17:18:57Z  Driver            Downloading image
    2020-03-11T17:18:41Z  Restarting        Task restarting in 16.008336604s
    2020-03-11T17:18:41Z  Terminated        Exit Code: 0
    2020-03-11T17:18:35Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
    2020-03-11T17:17:44Z  Started           Task started by client
    2020-03-11T17:17:40Z  Driver            Downloading image
    2020-03-11T17:17:40Z  Task Setup        Building Task Directory
    2020-03-11T17:17:34Z  Received          Task received by client
    
    Task "secondary" is "running"
    Task Resources
    CPU       Memory          Disk     Addresses
    0/48 MHz  1.2 MiB/32 MiB  300 MiB  secondary_port: 10.4.26.15:29470
    
    Task Events:
    Started At     = 2020-03-11T17:17:44Z
    Finished At    = N/A
    Total Restarts = 0
    Last Restart   = N/A
    
    Recent Events:
    Time                  Type        Description
    2020-03-11T17:17:44Z  Started     Task started by client
    2020-03-11T17:17:40Z  Driver      Downloading image
    2020-03-11T17:17:40Z  Task Setup  Building Task Directory
    2020-03-11T17:17:34Z  Received    Task received by client
    
  7. Nomad schedules a new allocation with the new job spec and tears down one of the old allocations, essentially continuing the deployment even though the healthchecks on the new allocs are still unhealthy. This is the behavior we're confused about:

    ID            = test-nomad-rolling-deployments
    Name          = test-nomad-rolling-deployments
    Submit Date   = 2020-03-11T17:17:33Z
    Type          = service
    Priority      = 50
    Datacenters   = a-dc
    Status        = running
    Periodic      = false
    Parameterized = false
    
    Summary
    Task Group                      Queued  Starting  Running  Failed  Complete  Lost
    test-nomad-rolling-deployments  0       0         3        0       2         0
    
    Latest Deployment
    ID          = d76e2a0e
    Status      = running
    Description = Deployment is running
    
    Deployed
    Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    test-nomad-rolling-deployments  true         3        2       1        0          2020-03-11T11:34:46-06:00
    
    Allocations
    ID        Node ID   Task Group                      Version  Desired  Status    Created     Modified
    116e086b  cced5c02  test-nomad-rolling-deployments  1        run      running   32s ago     19s ago
    cf26ce6c  c38e9054  test-nomad-rolling-deployments  1        run      running   1m46s ago   20s ago
    1081ddfd  cced5c02  test-nomad-rolling-deployments  0        stop     complete  11m44s ago  1m40s ago
    a2759873  0faae4b3  test-nomad-rolling-deployments  0        stop     complete  11m44s ago  26s ago
    ba6b37bc  ed21263f  test-nomad-rolling-deployments  0        run      running   11m44s ago  11m24s ago
    
  8. This continues until all healthy allocs are gone, replaced by unhealthy ones (although Nomad incorrectly thinks they're healthy):

    ID            = test-nomad-rolling-deployments
    Name          = test-nomad-rolling-deployments
    Submit Date   = 2020-03-11T17:17:33Z
    Type          = service
    Priority      = 50
    Datacenters   = a-dc
    Status        = running
    Periodic      = false
    Parameterized = false
    
    Summary
    Task Group                      Queued  Starting  Running  Failed  Complete  Lost
    test-nomad-rolling-deployments  0       0         3        0       3         0
    
    Latest Deployment
    ID          = d76e2a0e
    Status      = running
    Description = Deployment is running
    
    Deployed
    Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    test-nomad-rolling-deployments  true         3        3       2        0          2020-03-11T11:36:02-06:00
    
    Allocations
    ID        Node ID   Task Group                      Version  Desired  Status    Created     Modified
    1f4c25c1  f6f3bea0  test-nomad-rolling-deployments  1        run      running   52s ago     39s ago
    116e086b  cced5c02  test-nomad-rolling-deployments  1        run      running   2m8s ago    38s ago
    cf26ce6c  c38e9054  test-nomad-rolling-deployments  1        run      running   3m22s ago   40s ago
    1081ddfd  cced5c02  test-nomad-rolling-deployments  0        stop     complete  13m20s ago  3m16s ago
    a2759873  0faae4b3  test-nomad-rolling-deployments  0        stop     complete  13m20s ago  2m2s ago
    ba6b37bc  ed21263f  test-nomad-rolling-deployments  0        stop     complete  13m20s ago  46s ago
    

    State of new allocs after deployment completes:

    ID                  = 1f4c25c1
    Eval ID             = 9ac16764
    Name                = test-nomad-rolling-deployments.test-nomad-rolling-deployments[2]
    Node ID             = f6f3bea0
    Node Name           = a.node.tld
    Job ID              = test-nomad-rolling-deployments
    Job Version         = 824637845280
    Client Status       = running
    Client Description  = Tasks are running
    Desired Status      = run
    Desired Description = <none>
    Created             = 1m56s ago
    Modified            = 28s ago
    Deployment ID       = d76e2a0e
    Deployment Health   = healthy
    
    Task "main" is "running"
    Task Resources
    CPU       Memory          Disk     Addresses
    0/24 MHz  1.2 MiB/32 MiB  300 MiB  main_port: 10.4.22.187:21039
    
    Task Events:
    Started At     = 2020-03-11T17:21:31Z
    Finished At    = N/A
    Total Restarts = 1
    Last Restart   = 2020-03-11T11:21:12-06:00
    
    Recent Events:
    Time                  Type              Description
    2020-03-11T17:21:31Z  Started           Task started by client
    2020-03-11T17:21:29Z  Driver            Downloading image
    2020-03-11T17:21:12Z  Restarting        Task restarting in 17.009193127s
    2020-03-11T17:21:12Z  Terminated        Exit Code: 0
    2020-03-11T17:21:07Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
    2020-03-11T17:20:16Z  Started           Task started by client
    2020-03-11T17:20:09Z  Driver            Downloading image
    2020-03-11T17:20:09Z  Task Setup        Building Task Directory
    2020-03-11T17:20:03Z  Received          Task received by client
    
    Task "secondary" is "running"
    Task Resources
    CPU       Memory          Disk     Addresses
    0/48 MHz  1.2 MiB/32 MiB  300 MiB  secondary_port: 10.4.22.187:27842
    
    Task Events:
    Started At     = 2020-03-11T17:20:16Z
    Finished At    = N/A
    Total Restarts = 0
    Last Restart   = N/A
    
    Recent Events:
    Time                  Type        Description
    2020-03-11T17:20:16Z  Started     Task started by client
    2020-03-11T17:20:09Z  Driver      Downloading image
    2020-03-11T17:20:09Z  Task Setup  Building Task Directory
    2020-03-11T17:20:03Z  Received    Task received by client
    
    ID                  = 116e086b
    Eval ID             = 515b5394
    Name                = test-nomad-rolling-deployments.test-nomad-rolling-deployments[1]
    Node ID             = cced5c02
    Node Name           = a.node.tld
    Job ID              = test-nomad-rolling-deployments
    Job Version         = 824635642096
    Client Status       = running
    Client Description  = Tasks are running
    Desired Status      = run
    Desired Description = <none>
    Created             = 3m18s ago
    Modified            = 31s ago
    Deployment ID       = d76e2a0e
    Deployment Health   = healthy
    
    Task "main" is "running"
    Task Resources
    CPU       Memory          Disk     Addresses
    0/24 MHz  1.2 MiB/32 MiB  300 MiB  main_port: 10.4.26.52:27202
    
    Task Events:
    Started At     = 2020-03-11T17:21:34Z
    Finished At    = N/A
    Total Restarts = 2
    Last Restart   = 2020-03-11T11:21:15-06:00
    
    Recent Events:
    Time                  Type              Description
    2020-03-11T17:21:34Z  Started           Task started by client
    2020-03-11T17:21:32Z  Driver            Downloading image
    2020-03-11T17:21:15Z  Restarting        Task restarting in 17.237065933s
    2020-03-11T17:21:15Z  Terminated        Exit Code: 0
    2020-03-11T17:21:09Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
    2020-03-11T17:20:17Z  Started           Task started by client
    2020-03-11T17:20:15Z  Driver            Downloading image
    2020-03-11T17:19:57Z  Restarting        Task restarting in 18.136251856s
    2020-03-11T17:19:57Z  Terminated        Exit Code: 0
    2020-03-11T17:19:51Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
    
    Task "secondary" is "running"
    Task Resources
    CPU       Memory          Disk     Addresses
    0/48 MHz  1.2 MiB/32 MiB  300 MiB  secondary_port: 10.4.26.52:31300
    
    Task Events:
    Started At     = 2020-03-11T17:19:00Z
    Finished At    = N/A
    Total Restarts = 0
    Last Restart   = N/A
    
    Recent Events:
    Time                  Type        Description
    2020-03-11T17:19:00Z  Started     Task started by client
    2020-03-11T17:18:54Z  Driver      Downloading image
    2020-03-11T17:18:54Z  Task Setup  Building Task Directory
    2020-03-11T17:18:47Z  Received    Task received by client
    
    ID                  = cf26ce6c
    Eval ID             = 187a10ba
    Name                = test-nomad-rolling-deployments.test-nomad-rolling-deployments[0]
    Node ID             = c38e9054
    Node Name           = a.node.tld
    Job ID              = test-nomad-rolling-deployments
    Job Version         = 824635662768
    Client Status       = running
    Client Description  = Tasks are running
    Desired Status      = run
    Desired Description = <none>
    Created             = 4m36s ago
    Modified            = 39s ago
    Deployment ID       = d76e2a0e
    Deployment Health   = healthy
    
    Task "main" is "running"
    Task Resources
    CPU       Memory          Disk     Addresses
    0/24 MHz  1.2 MiB/32 MiB  300 MiB  main_port: 10.4.26.15:23739
    
    Task Events:
    Started At     = 2020-03-11T17:21:30Z
    Finished At    = N/A
    Total Restarts = 3
    Last Restart   = 2020-03-11T11:21:12-06:00
    
    Recent Events:
    Time                  Type              Description
    2020-03-11T17:21:30Z  Started           Task started by client
    2020-03-11T17:21:28Z  Driver            Downloading image
    2020-03-11T17:21:12Z  Restarting        Task restarting in 16.057916863s
    2020-03-11T17:21:12Z  Terminated        Exit Code: 0
    2020-03-11T17:21:07Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
    2020-03-11T17:20:15Z  Started           Task started by client
    2020-03-11T17:20:13Z  Driver            Downloading image
    2020-03-11T17:19:56Z  Restarting        Task restarting in 17.64927063s
    2020-03-11T17:19:56Z  Terminated        Exit Code: 0
    2020-03-11T17:19:50Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
    
    Task "secondary" is "running"
    Task Resources
    CPU       Memory          Disk     Addresses
    0/48 MHz  1.2 MiB/32 MiB  300 MiB  secondary_port: 10.4.26.15:29470
    
    Task Events:
    Started At     = 2020-03-11T17:17:44Z
    Finished At    = N/A
    Total Restarts = 0
    Last Restart   = N/A
    
    Recent Events:
    Time                  Type        Description
    2020-03-11T17:17:44Z  Started     Task started by client
    2020-03-11T17:17:40Z  Driver      Downloading image
    2020-03-11T17:17:40Z  Task Setup  Building Task Directory
    2020-03-11T17:17:34Z  Received    Task received by client
    

    Final state of deployment

    ID          = d76e2a0e
    Job ID      = test-nomad-rolling-deployments
    Job Version = 1
    Status      = successful
    Description = Deployment completed successfully
    
    Deployed
    Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    test-nomad-rolling-deployments  true         3        3       3        0          2020-03-11T11:37:17-06:00
    

Job file (if appropriate)

{
    "Job": {
        "ID": "test-nomad-rolling-deployments",
        "Name": "test-nomad-rolling-deployments",
        "Type": "service",
        "Priority": 50,
        "Region": "a-region",
        "DataCenters": [
            "a-dc"
        ],
        "TaskGroups": [
            {
                "Count": 3,
                "Tasks": [
                    {
                        "Driver": "docker",
                        "Resources": {
                            "CPU": 48,
                            "MemoryMB": 32,
                            "Networks": [
                                {
                                    "DynamicPorts": [
                                        {
                                            "label": "main_port"
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "PortLabel": "main_port",
                                "Checks": [
                                    {
                                        "Type": "http",
                                        "Interval": 10000000000,
                                        "Timeout": 5000000000,
                                        "CheckRestart": {
                                            "Limit": 3,
                                            "Grace": 30000000000
                                        },
                                        "Path": "/healthcheck-ok",
                                        "Name": "healthcheck"
                                    }
                                ],
                                "Name": "test-rolling-restart-service-main"
                            }
                        ],
                        "ShutdownDelay": 5000000000,
                        "Templates": [
                            {
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "local/nginx.conf",
                                "Perms": "0644",
                                "ChangeMode": "signal",
                                "EmbeddedTmpl": "events {}\n\nhttp {\n  server {\n    location /healthcheck-ok {\n      return 200 'OK';\n      add_header Content-Type text/plain;\n    }\n\n    location /healthcheck-not-ok {\n      return 500 'NOT OK';\n      add_header Content-Type text/plain;\n    }\n  }\n}\n"
                            }
                        ],
                        "Config": {
                            "force_pull": true,
                            "image": "nginx:latest",
                            "port_map": [
                                {
                                    "main_port": 80
                                }
                            ],
                            "volumes": [
                                "local:/etc/nginx"
                            ]
                        },
                        "KillTimeout": 15000000000,
                        "Name": "main"
                    },
                    {
                        "Driver": "docker",
                        "Resources": {
                            "CPU": 48,
                            "MemoryMB": 32,
                            "Networks": [
                                {
                                    "DynamicPorts": [
                                        {
                                            "label": "secondary_port"
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "PortLabel": "secondary_port",
                                "Checks": [
                                    {
                                        "Type": "http",
                                        "Interval": 10000000000,
                                        "Timeout": 5000000000,
                                        "CheckRestart": {
                                            "Limit": 3,
                                            "Grace": 180000000000
                                        },
                                        "Path": "/healthcheck-ok",
                                        "Name": "healthcheck"
                                    }
                                ],
                                "Name": "test-rolling-restart-service-secondary"
                            }
                        ],
                        "ShutdownDelay": 5000000000,
                        "Templates": [
                            {
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "local/nginx.conf",
                                "Perms": "0644",
                                "ChangeMode": "signal",
                                "EmbeddedTmpl": "events {}\n\nhttp {\n  server {\n    location /healthcheck-ok {\n      return 200 'OK';\n      add_header Content-Type text/plain;\n    }\n\n    location /healthcheck-not-ok {\n      return 500 'NOT OK';\n      add_header Content-Type text/plain;\n    }\n  }\n}\n"
                            }
                        ],
                        "Config": {
                            "force_pull": true,
                            "image": "nginx:latest",
                            "port_map": [
                                {
                                    "secondary_port": 80
                                }
                            ],
                            "volumes": [
                                "local:/etc/nginx"
                            ]
                        },
                        "KillTimeout": 15000000000,
                        "Name": "secondary"
                    }
                ],
                "RestartPolicy": {
                    "Attempts": 3,
                    "Delay": 15000000000,
                    "Interval": 180000000000,
                    "Mode": "fail"
                },
                "Update": {
                    "MaxParallel": 1,
                    "AutoRevert": true,
                    "HealthCheck": "checks",
                    "ProgressDeadline": 960000000000,
                    "HealthyDeadline": 900000000000,
                    "MinHealthyTime": 10000000000,
                    "Stagger": 10000000000
                },
                "Name": "test-nomad-rolling-deployments"
            }
        ]
    }
}

I've left off other logs as I think the repro steps are sufficient and this reproduces 100% of the time in our setup, but happy to gather some if necessary.

@djenriquez
Copy link

djenriquez commented Mar 12, 2020

We have also seen this problem in our testing with v0.10.4, and don't fully understand the situation but it's definitely a problem. Basically, no deployment will ever fail due to health checks currently.

We have a job whose task will never get healthy, yet for some reason, Nomad always passes the deployment. Consul properly reports the task as unhealthy.

Another big problem related to this is that it seems the restart only restarts successfully once. We've replicated this behavior a few times now. The check_restart does its job by triggering the first restart, and then nothing. It seems the check_restart after an allocation is restarted the first time is unable to trigger the restart policy.

We can mitigate this by setting the restart policy to 0 with mode: fail, that will always force a reschedule on the first failure. However, this speaks another problematic scenario which we cannot confirm: if a healthy task all of a sudden goes unhealthy, does the check_restart policy go into effect?

During testing, we've set the task's check's check_restart to grace:0, limit:3, with a check interval of 10s. This results successfully in the first restart after 20s (1st check happens immediately it seems, then 2 more checks of 10s).

Solid report @dpn, this regression seems pretty dire.

Thank you @kainoaseto and @tydomitrovich for finding and helping to troubleshoot.

@djenriquez
Copy link

djenriquez commented Mar 12, 2020

Hi guys, apologies if I'm inflating the priority for this issue, but it seems pretty serious that we cannot depend on health checks of allocations during deployments.

Could we get confirmation that this issue has been acknowledged and is being prioritized (hopefully on the higher side)?

@tgross @drewbailey @dadgar ?

@notnoop
Copy link
Contributor

notnoop commented Mar 13, 2020

@dpn @djenriquez This seems very bad indeed. I'll be investigating this now and will post updates when I get an understanding of the underlying issue and if there are any mitigating factors. Thank you very much for the detailed and clear reproducibility steps.

@notnoop notnoop self-assigned this Mar 13, 2020
@notnoop
Copy link
Contributor

notnoop commented Mar 16, 2020

Thanks again for the issue. It's indeed very serious - it affects virtually all deployments and affects nomad versions as old as 0.8.0, but I believe earlier.

It affects deployments where min_healthy_time is less than the restart delay. While the task is being restarted, nomad client may consider it healthy!

One workaround is to increase min_healthy_time to be higher than possible restart delays.

I'm working on the fix and aim to have it ready later this week.

@dpn
Copy link
Author

dpn commented Mar 16, 2020

Thanks @notnoop, really appreciate you digging into this. Do you think this will be backported to the 0.9 and 0.10 series of releases? I know we're lagging behind by being on 0.9 but we'll be finishing up our 0.10 validation soon and plan to migrate over once that's complete.

@kainoaseto
Copy link

Thank you @notnoop for looking into this and the workaround in the meantime! I will look at implementing that fix in our job's for our 0.10 clusters to mitigate this bug and will watch for the fix later this week.

@kainoaseto
Copy link

Hi @notnoop and anyone else that runs into this before the release of the fix in 0.11.0. I was able to test the mitigation by changing the Restart.Delay to be < min_healthy_time as was suggested and was able to:

  • have allocations fail during deployments from health checks failing
  • have allocations fail and reschedule from health checks failing

Thanks for the workaround!

Below is some sample configuration in case anyone else runs into the same thing:

All at the taskgroup level:

    "ReschedulePolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "DelayFunction": "exponential",
        "Interval": 0,
        "MaxDelay": 60000000000,
        "Unlimited": true
      },
      "RestartPolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "Interval": 1800000000000,
        "Mode": "fail"
      },
      "Services": [
        {
          "AddressMode": "auto",
          "Checks": [
            {
              "AddressMode": "",
              "CheckRestart": {
                "Grace": 10000000000,
                "IgnoreWarnings": false,
                "Limit": 3
              },
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "InitialStatus": "warning",
              "Interval": 10000000000,
              "Method": "GET",
              "Name": "healthy",
              "Path": "/healthcheck",
              "PortLabel": "my-service",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 5000000000,
              "Type": "http"
            }
          ],
    .
    .
    .
    "Update": {
        "AutoPromote": false,
        "AutoRevert": true,
        "Canary": 0,
        "HealthCheck": "checks",
        "HealthyDeadline": 300000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 200000000000,
        "ProgressDeadline": 600000000000,
        "Stagger": 30000000000
      },

@dpn
Copy link
Author

dpn commented Mar 28, 2020

Thanks @notnoop for the quick fix!

@Laboltus
Copy link

I experience the same behavior with 0.11.3. Nomad does not wait until the current alloc's become healthy before restart the next ones.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants