Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad alloc stuck in pending state on a new job #9375

Closed
ghost opened this issue Nov 17, 2020 · 20 comments
Closed

Nomad alloc stuck in pending state on a new job #9375

ghost opened this issue Nov 17, 2020 · 20 comments

Comments

@ghost
Copy link

ghost commented Nov 17, 2020

Nomad alloc stuck in pending state

Nomad version

1.0.0-beta3

Operating system and Environment details

Amazon Linux 2
Linux ip-172-31-3-147.service.consul 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue

when i deploy a new job the old alloc are not getting removed. i see the same issue on check_restart not restarting the jobs correctly.

ID                  = ae7381fd-7190-1b78-ef40-d9b1a0742849
Eval ID             = ca83735c
Name                = traefik.traefik[0]
Node ID             = 19d191da
Node Name           = ip-172-31-3-147.service.consul
Job ID              = traefik
Job Version         = 12
Client Status       = pending
Client Description  = No tasks have started
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 49m39s ago
Modified            = 4m41s ago

Task "filebeat-sidecar" (prestart sidecar) is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB  

Task Events:
Started At     = N/A
Finished At    = 2020-11-17T12:05:58Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type      Description
2020-11-17T12:05:58Z  Received  Task received by client
2020-11-17T11:21:00Z  Received  Task received by client

Task "traefik" is "pending"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  128 MiB  300 MiB  http: 172.31.3.147:8888
                           api: 172.31.3.147:8081

Task Events:
Started At     = N/A
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type      Description
2020-11-17T12:05:58Z  Received  Task received by client
2020-11-17T11:44:27Z  Killing   Sent interrupt. Waiting 5s before force killing
2020-11-17T11:21:00Z  Received  Task received by client

Reproduction steps

Job file (if appropriate)

job "traefik" {
  region      = "global"
  datacenters = ["development"]
  type        = "service"

  vault {
    policies = [
      "reader",
    ]

    env = "true"
  }

  group "traefik" {
    count = 1

    task "traefik" {
      leader = true
      driver = "docker"

      template {
        change_mode = "restart"

        data = <<EOH
{{with secret "secret/dev"}}{{.Data.data.ca}}{{end}}
EOH

        destination = "local/ca.pem"
      }

      template {
        change_mode = "restart"

        data = <<EOH
{{with secret "secret/dev"}}{{.Data.data.crt}}{{end}}
EOH

        destination = "local/crt.pem"
      }

      template {
        change_mode = "restart"

        data = <<EOH
{{with secret "secret/dev"}}{{.Data.data.key}}{{end}}
EOH

        destination = "local/key.pem"
      }

      template {
        data = <<EOF

logLevel = "INFO"

[accessLog]
format = "json"

[entryPoints]
  [entryPoints.https]
  address = ":8888"

  [entryPoints.traefik]
  address = ":8081"        

[api]
dashboard = true
insecure  = true

[ping]
entryPoint = "traefik"

# Enable Consul Catalog configuration backend.
[providers.consulCatalog]
prefix           = "traefik"
exposedByDefault = false

  [providers.consulCatalog.endpoint]
  address = "127.0.0.1:8501"
  scheme  = "https"
  datacenter = "airside-stage"
  token = "{{with secret "secret/token"}}{{.Data.data.key}}{{end}}"

    [providers.consulCatalog.endpoint.tls]
    ca   = "local/ca.pem"
    cert = "local/crt.pem"
    key  = "local/key.pem"
    insecureSkipVerify = true

[tls.certificates]
  certFile = "local/crt.pem"
  keyFile = "local/key.pem"
  [tls.options.default]
    minVersion = "VersionTLS12"
    maxVersion = "VersionTLS12"
    cipherSuites = [
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
    ]

EOF

        destination = "local/traefik.toml"
      }

      config {
        image        = "traefik:v2.3.2"
        network_mode = "host"


        volumes = [
          "local/traefik.toml:/etc/traefik/traefik.toml",
          "local/ca.pem:/opt/prometheus/ca.pem",
          "local/crt.pem:/opt/prometheus/crt.pem",
          "local/key.pem:/opt/prometheus/key.pem",
        ]

        dns_servers = [
          "${attr.unique.network.ip-address}",
        ]
      }

      resources {
        cpu    = 100
        memory = 128

        network {
          mbits = 10

          port "http" {
            static = 8888
          }

          port "api" {
            static = 8081
          }
        }
      }

      service {
        name = "traefik"

        check {
          name     = "alive"
          type     = "tcp"
          port     = "http"
          interval = "10s"
          timeout  = "2s"
        }
      }
    },
    task "filebeat-sidecar" {
      driver = "docker"
      
      template {
        change_mode = "restart"

        data = <<EOH
http.enabled: true
http.port: 5067
monitoring.enabled: false

filebeat.inputs:
- type: log
  paths: 
    - /alloc/logs/${NOMAD_GROUP_NAME}.stderr.[0-9]*
    - /alloc/logs/${NOMAD_GROUP_NAME}.stdout.[0-9]*

processors:
- decode_json_fields:
    fields: ["message"]
    target: "json"
    overwrite_keys: true

output.logstash:
  hosts: ["filebeat-log.service.consul:5044"]
  codec.json:
    pretty: true
    escape_html: false
EOH

        destination = "local/filebeat.yml"
      }

      config {
        image = "redacted/base-images/nomad-filebeat:7.9.3"

        dns_servers = [
          "${attr.unique.network.ip-address}",
        ]
        args    = ["-e"] 
        volumes = [
          "local/filebeat.yml:/usr/share/filebeat/filebeat.yml",
        ]
      }
    }
  }
}

Nomad Client logs (if appropriate)

Nov 17 12:05:58 ip-172-31-3-147 nomad[8238]: client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=filebeat-sidecar
Nov 17 12:05:58 ip-172-31-3-147 nomad[8238]: client.alloc_runner.task_runner: task run loop exiting: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=filebeat-sidecar
Nov 17 12:05:58 ip-172-31-3-147 nomad[8238]: client.gc: marking allocation for GC: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849
Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.208Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=filebeat-sidecar
Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.237Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=filebeat-sidecar
Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.242Z [INFO]  client.gc: marking allocation for GC: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849
Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.266Z [INFO]  client.gc: garbage collecting allocation: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 reason="forced collection"
Nov 17 12:05:58 ip-172-31-3-147 nomad[8238]: client.gc: garbage collecting allocation: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 reason="forced collection"
Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.351Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=traefik
Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.355Z [DEBUG] client.gc: alloc garbage collected: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849
Nov 17 12:05:58 ip-172-31-3-147 nomad[8238]: client.alloc_runner.task_runner: task run loop exiting: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=traefik
Nov 17 12:05:58 ip-172-31-3-147 nomad[8238]: client.gc: alloc garbage collected: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849
Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.419Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=filebeat-sidecar
Nov 17 12:05:58 ip-172-31-3-147 nomad[8238]: client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=filebeat-sidecar
Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.422Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=filebeat-sidecar
Nov 17 12:05:58 ip-172-31-3-147 nomad[8238]: client.alloc_runner.task_runner: task run loop exiting: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 task=filebeat-sidecar
Nov 17 12:10:39 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:10:39.765Z [DEBUG] http: request complete: method=GET path=/v1/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849?namespace=default duration=5.977726ms
Nov 17 12:10:39 ip-172-31-3-147 nomad[8238]: http: request complete: method=GET path=/v1/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849?namespace=default duration=5.977726ms
Nov 17 12:10:39 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:10:39.774Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849/stats duration=1.497686ms
Nov 17 12:10:39 ip-172-31-3-147 nomad[8238]: http: request complete: method=GET path=/v1/client/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849/stats duration=1.497686ms
Nov 17 12:13:04 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:13:04.968Z [DEBUG] http: request complete: method=GET path=/v1/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849?index=658955 duration=5m8.723821704s
Nov 17 12:13:04 ip-172-31-3-147 nomad[8238]: http: request complete: method=GET path=/v1/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849?index=658955 duration=5m8.723821704s
Nov 17 12:16:19 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:16:19.472Z [DEBUG] http: request complete: method=GET path=/v1/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849?namespace=default duration=1.491316ms
Nov 17 12:16:19 ip-172-31-3-147 nomad[8238]: http: request complete: method=GET path=/v1/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849?namespace=default duration=1.491316ms
Nov 17 12:16:19 ip-172-31-3-147 nomad[8238]: http: request complete: method=GET path=/v1/client/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849/stats duration=1.460432ms
Nov 17 12:16:19 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:16:19.480Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/ae7381fd-7190-1b78-ef40-d9b1a0742849/stats duration=1.460432ms
@ghost
Copy link
Author

ghost commented Nov 19, 2020

here is another job that nomad seams to not be able to restart or kill if the task gets a "Restart Signaled | Template with change_mode restart re-rendered"

Screen Shot 2020-11-19 at 9 14 56 AM

{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "share",
  "ParentID": "",
  "Name": "share",
  "Type": "service",
  "Priority": 100,
  "AllAtOnce": false,
  "Datacenters": [
    "airside-stage"
  ],
  "Constraints": [
    {
      "LTarget": "${meta.environment}",
      "RTarget": "nomad",
      "Operand": "="
    }
  ],
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "share",
      "Count": 2,
      "Update": {
        "Stagger": 30000000000,
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 30000000000,
        "HealthyDeadline": 300000000000,
        "ProgressDeadline": 600000000000,
        "AutoRevert": true,
        "AutoPromote": true,
        "Canary": 1
      },
      "Migrate": {
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000
      },
      "Constraints": [
        {
          "LTarget": "${attr.vault.version}",
          "RTarget": ">= 0.6.1",
          "Operand": "semver"
        }
      ],
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 2,
        "Interval": 1800000000000,
        "Delay": 60000000000,
        "Mode": "fail"
      },
      "Tasks": [
        {
          "Name": "share",
          "Driver": "docker",
          "User": "",
          "Config": {
            "force_pull": true,
            "image": "redacted",
            "port_map": [
              {
                "http": 8080
              }
            ],
            "dns_servers": [
              "${attr.unique.network.ip-address}"
            ]
          },
          "Env": null,
          "Services": [
            {
              "Name": "share",
              "TaskName": "",
              "PortLabel": "http",
              "AddressMode": "auto",
              "EnableTagOverride": false,
              "Tags": [
                "share",
                "urlprefix-/redacted"
              ],
              "CanaryTags": null,
              "Checks": [
                {
                  "Name": "service: \"share\" check",
                  "Type": "http",
                  "Command": "",
                  "Args": null,
                  "Path": "redacted",
                  "Protocol": "",
                  "PortLabel": "",
                  "Expose": false,
                  "AddressMode": "",
                  "Interval": 30000000000,
                  "Timeout": 5000000000,
                  "InitialStatus": "",
                  "TLSSkipVerify": false,
                  "Method": "",
                  "Header": null,
                  "CheckRestart": {
                    "Limit": 3,
                    "Grace": 90000000000,
                    "IgnoreWarnings": false
                  },
                  "GRPCService": "",
                  "GRPCUseTLS": false,
                  "TaskName": "",
                  "SuccessBeforePassing": 0,
                  "FailuresBeforeCritical": 0
                }
              ],
              "Connect": null,
              "Meta": null,
              "CanaryMeta": null
            }
          ],
          "Vault": {
            "Policies": [
              "share-service"
            ],
            "Namespace": "",
            "Env": true,
            "ChangeMode": "restart",
            "ChangeSignal": "SIGHUP"
          },
          "Templates": [
            {
              "SourcePath": "",
              "DestPath": "secrets/config.env",
              "EmbeddedTmpl": "# test 3\nCONSUL_HTTP_ADDR=consul.service.consul:8500\nCONSUL_HTTP_TOKEN=\"{{with secret \"consul/creds/airside-app\"}}{{.Data.token}}{{end}}\"\nVAULT_SKIP_VERIFY=false\nVAULT_CACERT=/ca.pem\nVAULT_ADDR=https://vault.service.consul:8200\nBUILDENV_ENV_TEMPLATE_FILE=env/stage.env.template\n\n",
              "ChangeMode": "restart",
              "ChangeSignal": "",
              "Splay": 180000000000,
              "Perms": "0644",
              "LeftDelim": "{{",
              "RightDelim": "}}",
              "Envvars": true,
              "VaultGrace": 0
            }
          ],
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 100,
            "MemoryMB": 768,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": [
              {
                "Mode": "",
                "Device": "",
                "CIDR": "",
                "IP": "",
                "MBits": 1,
                "DNS": null,
                "ReservedPorts": null,
                "DynamicPorts": [
                  {
                    "Label": "http",
                    "Value": 0,
                    "To": 0,
                    "HostNetwork": "default"
                  }
                ]
              }
            ],
            "Devices": null
          },
          "RestartPolicy": {
            "Attempts": 2,
            "Interval": 1800000000000,
            "Delay": 60000000000,
            "Mode": "fail"
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 10,
            "MaxFileSizeMB": 10
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 300,
        "Migrate": false
      },
      "Meta": null,
      "ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 120000000000,
        "Unlimited": true
      },
      "Affinities": [
        {
          "LTarget": "${platform.aws.instance-type}",
          "RTarget": "t3.xlarge",
          "Operand": "=",
          "Weight": 100
        }
      ],
      "Spreads": null,
      "Networks": null,
      "Services": null,
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null
    }
  ],
  "Update": {
    "Stagger": 30000000000,
    "MaxParallel": 1,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "Payload": null,
  "Meta": null,
  "ConsulToken": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "aa4f468a-379c-ec32-0c4b-3b32216ad675",
  "Status": "pending",
  "StatusDescription": "",
  "Stable": false,
  "Version": 4,
  "SubmitTime": 1605793813839163400,
  "CreateIndex": 758455,
  "ModifyIndex": 767580,
  "JobModifyIndex": 767580
}
$ nomad job status share
ID            = share
Name          = share
Submit Date   = 2020-11-19T08:50:13-05:00
Type          = service
Priority      = 100
Datacenters   = airside-stage
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
share       2       1         4        0       1         0

Latest Deployment
ID          = 2345560c
Status      = running
Description = Deployment is running

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
share       true         2        0       0        0          N/A

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
6c654faf  12994d28  share       2        stop     running  23h55m ago  14m35s ago
b71ca426  97e475a7  share       2        stop     running  23h56m ago  14m35s ago
657a34e4  97e475a7  share       0        stop     running  2d1m ago    23h56m ago
830ef2b0  12994d28  share       0        stop     running  2d2m ago    1d6m ago

here is a alloc that got a template restart but did not actually restart it.

nomad alloc status 6c654faf
ID                  = 6c654faf-0d55-9a77-62a7-fc12861397da
Eval ID             = 19766efd
Name                = share.share[0]
Node ID             = 12994d28
Node Name           = ip-172-31-1-124.service.consul
Job ID              = share
Job Version         = 2
Client Status       = running
Client Description  = Tasks are running
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 23h45m ago
Modified            = 4m46s ago
Deployment ID       = 50ef76c2
Deployment Health   = healthy

Task "share" is "running"
Task Resources
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/768 MiB  300 MiB  http: 172.31.1.124:23358

Task Events:
Started At     = 2020-11-18T14:09:10Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type              Description
2020-11-19T08:50:10-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2020-11-19T08:33:10-05:00  Restart Signaled  Template with change_mode restart re-rendered
2020-11-18T09:09:10-05:00  Started           Task started by client
2020-11-18T09:09:09-05:00  Driver            Downloading image
2020-11-18T09:09:09-05:00  Task Setup        Building Task Directory
2020-11-18T09:09:08-05:00  Received          Task received by client

the job spec only wants to have a count of 2 running but you can see on this host i have 3 docker jobs that are running but each one of the these are dead.

[root@ip-172-31-x-xx ~]# docker ps | grep share
d0f8cc77608e        redacted/share-service:1.13.8              "/bin/sh -c 'env -S …"   24 hours ago        Up 24 hours         172.31.x.xx:23358->8080/tcp, 172.31.1.124:23358->8080/udp                                  share-6c654faf-0d55-9a77-62a7-fc12861397da
84393f5019e0        redacted/share-service:1.13.8              "/bin/sh -c 'env -S …"   2 days ago          Up 2 days           172.31.x.xx:25468->8080/tcp, 172.31.1.124:25468->8080/udp                                  share-830ef2b0-b721-94ce-eccb-4a71758d6433
f497b585644a        redacted/share-service:1.13.8              "/bin/sh -c 'env -S …"   3 days ago          Up 3 days           172.31.x.xx:30397->8080/tcp, 172.31.1.124:30397->8080/udp  

@marcotuna
Copy link

I am also using 1.0.0 beta3 and many of my jobs are getting pending, some of them are stuck in "Pending" during two days and from night to day they go to "Running"

@notnoop
Copy link
Contributor

notnoop commented Nov 30, 2020

@marcchua Thanks for reporting the issue. This ticket is my top priority for the week. Having more data will be extremely helpful. In particular, I'd appreciate capturing the following when an alloc is marked in a pending state:

  • The client logs (helpful in DEBUG mode)
  • a tarball of the data-dir of one of the servers as well as the client
  • the output of nomad alloc status of pending status
  • the stack trace of the client (obtained by kill -SIGABRT <<nomad pid>>)

You can email them to [email protected] if they contain some sensitive info that you prefer not to post here.

The data will help us pinpoint the problem quicker. Thank you so much!

@ghost
Copy link
Author

ghost commented Nov 30, 2020

from what i have identified is once i correct the group network depreciation configuration. documented here https://www.nomadproject.io/docs/upgrade/upgrade-specific#mbits-and-task-network-resource-deprecation and add the kill_signal to the docker task things seam to be working allot better.

     shutdown_delay = "30s"
     kill_signal = "SIGTERM"

without these changes i have had up to 4 versions of a job running and unable to cleanup the jobs without having to purge jobs or restart servers.

@notnoop notnoop added this to the 1.0 milestone Nov 30, 2020
@notnoop
Copy link
Contributor

notnoop commented Nov 30, 2020

Thanks @oopstrynow for the info. If you are able, can you reproduce the issue on your end and send us the info I requested in my last comment.

Also, did you also run into this issue with 1.0.0-beta1 (or an earlier version)?

@tgross
Copy link
Member

tgross commented Dec 1, 2020

@oopstrynow I'm seeing this line in the client logs:

Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.266Z [INFO] client.gc: garbage collecting allocation: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 reason="forced collection"

I did a little tracing through the code and that "reason" message looks like we should only see it if there was a request to the HTTP API (ref). While I definitely can't begin to suggest that's a root cause, is there any chance you have automation that's cleaning up allocations?

@tgross
Copy link
Member

tgross commented Dec 1, 2020

Quick status update on where we are with this investigation. I've been trying to reproduce locally based on the information we have so far. I haven't been able to do so yet, but I do at least have part of a proposed mechanism at this point:

  • An allocation is being marked as eligible for GC by the client
  • The client GC HTTP API call is fired
  • The GC removes the alloc runner from the list of eligible alloc runners for GC
  • The GC calls the alloc runner's Destroy method (There's a race condition here between the marking ineligible and calling Destroy, but I don't think that's involved.)
  • The client gets a concurrent (non-destructive?) update for this allocation from the server
  • The alloc runner's destroyImpl closes the task's kill context before the hooks run
  • The task hangs during killing b/c of faulty signal handling (pretty common), which makes the logs confusing but also means we can return from Destroy without a completed destroyImpl because we run that in a goroutine, and that means the GC can unblock.
  • We end up multiple "received task" events landing.

What I don’t know yet:

  • What is marking the alloc as eligible for GC in the first place
  • Why this would result in an orphaned task running.
jobspec for testing
job "example" {
  datacenters = ["dc1"]

  group "cache" {

    network {
      port "db" {
        to = 6379
      }
    }

    task "sidecar" {
      driver = "docker"

      lifecycle {
        hook    = "prestart"
        sidecar = true
      }

      config {
        image   = "docker.elastic.co/beats/filebeat:7.10.0"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
        #args  = ["-e"]
        volumes = [
          "local/filebeat.yml:/usr/share/filebeat/filebeat.yml",
        ]
      }

      template {
        change_mode = "restart"
        data        = <<EOH
#!/bin/bash
trap 'echo "received SIGINT"' INT

filebeat -e
EOH

        destination = "local/script.sh"
      }

      template {
        change_mode = "restart"

        data = <<EOH
http.enabled: false
monitoring.enabled: false

filebeat.inputs:
- type: log
  paths:
    - /alloc/logs/redis.stderr.[0-9]*
    - /alloc/logs/redis.stdout.[0-9]*

processors:
- decode_json_fields:
    fields: ["message"]
    target: "json"
    overwrite_keys: true

output.console:
  pretty: true
EOH

        destination = "local/filebeat.yml"
      }

      env {
        KEY = "X"
      }

      resources {
        cpu    = 256
        memory = 128
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
        ports = ["db"]
      }

      env {
        KEY = "X"
      }

      resources {
        cpu    = 256
        memory = 128
      }

      template {
        change_mode = "restart"
        data        = <<EOF
{{ range service "nomad-client" }}
server {{ .Name }} {{ .Address }}:{{ .Port }}{{ end }}
EOF
        destination = "local/template1"
      }

      template {
        change_mode = "restart"
        data        = <<EOF
{{ range service "nomad" }}
server {{ .Name }} {{ .Address }}:{{ .Port }}{{ end }}
EOF
        destination = "local/template2"
      }
    }

  }
}

@ghost
Copy link
Author

ghost commented Dec 1, 2020

@oopstrynow I'm seeing this line in the client logs:

Nov 17 12:05:58 ip-172-31-3-147 nomad-vault.sh: 2020-11-17T12:05:58.266Z [INFO] client.gc: garbage collecting allocation: alloc_id=ae7381fd-7190-1b78-ef40-d9b1a0742849 reason="forced collection"

I did a little tracing through the code and that "reason" message looks like we should only see it if there was a request to the HTTP API (ref). While I definitely can't begin to suggest that's a root cause, is there any chance you have automation that's cleaning up allocations?

i do not have anything cleaning up old allocations. maybe a new task is attempting to remove the older version of the task?

@ghost
Copy link
Author

ghost commented Dec 1, 2020

Sorry i did not get back to you sooner. my goal is to work on reproducing this with a 0.12.8 server/client with a job file that does not include the 0.12.8 network changes that move the network to the group. then update the server to 1.0.00-beta3 and i suspect this will reproduce the alloc problems i found. below is what my fabio config looks like minus some vault parts i removed.

job "fabio" {
  datacenters = ["dc1"]
  type        = "system"

  update {
    max_parallel     = 1
    min_healthy_time = "2m"
    healthy_deadline = "3m"
    auto_revert      = true
  }

  constraint {
    attribute = "${meta.environment}"
    operator  = "="
    value     = "nomad"
  }

  priority = 100

  group "fabio" {
    task "fabio" {

      template {
        data = <<EOH
log.access.target=stdout
EOH

        destination = "local/fabio.properties"
      }

      driver = "docker"

      config {
        force_pull = true
        image      = "fabiolb/fabio:1.5.13-go1.13.4"

        volumes = [
          "local/fabio.properties:/etc/fabio/fabio.properties"
        ]

        network_mode = "host"
      }

      resources {
        cpu    = 100
        memory = 500

        network {
          mbits = 100

          port "lb" {
            static = 9999
          }

          port "ui" {
            static = 9998
          }
        }
      }
    }
  }
}

@tgross
Copy link
Member

tgross commented Dec 2, 2020

i do not have anything cleaning up old allocations. maybe a new task is attempting to remove the older version of the task?

Yeah, with some fresh eyes I'm seeing that we can hit that if the server gives the client instructions to remove it in (*Client).runAllocs.

@tgross
Copy link
Member

tgross commented Dec 2, 2020

@oopstrynow I'm digging into the interaction between Nomad and Docker and its signal handling a bit and I want to mirror the OS/dockerd as closely as I can. Are you using the Docker that ships with the Amazon Linux AMI you're using? Something like:

[ec2-user@ip-172-31-38-177 ~]$ uname -a
Linux ip-172-31-38-177.ec2.internal 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

[ec2-user@ip-172-31-38-177 ~]$ docker --version
Docker version 19.03.13-ce, build 4484c46

@ghost
Copy link
Author

ghost commented Dec 2, 2020

here is the kernel and docker version we have.

[root@ip-172-31-2-207 ~]# uname -a
Linux ip-172-31-2-207.service.consul 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

[root@ip-172-31-2-207 ~]# docker --version
Docker version 19.03.13-ce, build 4484c46

@tgross
Copy link
Member

tgross commented Dec 2, 2020

Ok, so at this point I have a partial reproduction and one newly found Nomad bug to fix.

Our Docker driver has a method (*taskHandle) Kill which takes the kill timeout. This function calls (*taskHandle) Signal and then waits for the kill timeout duration. However, that Signal method does not directly trigger an asynchronous Unix signal; it sends a Docker API request to dockerd. The go-dockerclient library's KillContainer method that sends the API request accept a context for the timeout, but we never configure that timeout. If the Docker API were to hang, we would also hang before we ever got to the kill timeout.

As it turns out, there are recently-opened bugs for this version of Docker that involve just that!

With that in hand, I deployed a build of Nomad with the go-dockerclient call modified so that it could hang for up to 30s (long past our kill timeout). After deploying a job that traps signals updates were slow, but combining that with a client restart I ended up with a Nomad client that lost the plot.

The allocations that are restored after the client restart all have multiple Task Received events just as we see above. And I end up with multiple allocations in a running state!

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
b99e1af4  a27dcd9d  cache       13       run      pending   6m10s ago   6m10s ago
5708a572  a27dcd9d  cache       12       stop     running   6m18s ago   6m7s ago
6a3a5ba1  a27dcd9d  cache       12       run      running   13m11s ago  6m7s ago
54cb0433  a27dcd9d  cache       11       stop     complete  13m45s ago  6m18s ago

Including their Docker containers:

$ sudo docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                          NAMES
79630582c10d        redis:3.2           "docker-entrypoint.s…"   6 minutes ago       Up 6 minutes        172.31.38.177:30864->6379/tcp, 172.31.38.177:30864->6379/udp   redis-5708a572-497e-b835-90a9-a99e6323780f
e69a9b10ae23        redis:3.2           "docker-entrypoint.s…"   6 minutes ago       Up 6 minutes        172.31.38.177:25187->6379/tcp, 172.31.38.177:25187->6379/udp   redis-6a3a5ba1-cd8e-f787-8b7a-40ab0810a4ed

This state persists for as long as the deployment progress deadline (10min), at which point the Nomad servers intervene and force everything to get cleaned up.

I took a goroutine dump of Nomad in this state and I find that I have two task runners running, one of them stuck on killTask:

goroutine 623 [sleep]:
time.Sleep(0x1bf08eb000)
	runtime/time.go:188 +0xbf
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).killTask(0xc00090b600, 0xc000a6cc00, 0x2, 0x0)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:937 +0x91
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).handleKill(0xc00090b600, 0xc000bf1c28)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:889 +0xbf
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc00090b600)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:560 +0x1113
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:358 +0xa5
goroutine 602 [select, 4 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc00094adc0)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:556 +0xf52
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:358 +0xa5

It's not a 100% reproduction, because Nomad does eventually seem to get its act together in this repro. So either the job I'm using doesn't quite replicate the circumstances or maybe we need to block forever at the Docker API to replicate it.

I'll work up a PR to add the timeout we need to the Docker driver... this is a longstanding bug but it may be that @oopstrynow is hitting it now because of an unlucky combination of the Docker bug and our changing the default signal handling behavior which impacted his application.

I'm not convinced we've got the entire story yet but we're getting closer.

@ghost
Copy link
Author

ghost commented Dec 2, 2020

@tgross wow this is awesome. thank you for digging into this issue

@tgross
Copy link
Member

tgross commented Dec 2, 2020

I've just merged #9502, which will ship in the release candidate we're building later today for likely release tomorrow.

@tgross tgross removed this from the 1.0 milestone Dec 8, 2020
@tgross
Copy link
Member

tgross commented Dec 14, 2020

@notnoop and I had a discussion with @oopstrynow and I think we're in a good spot with this and #9376 now:

I'm going to close this issue and #9376. Thanks for your help @oopstrynow and please let us know if you see anything like this again!

@tgross tgross closed this as completed Dec 14, 2020
@jorgemarey
Copy link
Contributor

Hi @tgross this just happened to us. An allocation kept on pending state on a new deployment due to another one not stopping.

Nomad version 1.2.6

Here are the allocation events:

Aug 25, '22 07:56:41 +0200	Killed	Task successfully killed
Aug 25, '22 07:56:41 +0200	Terminated	Exit Code: 0
Aug 25, '22 07:45:19 +0200	Killing	Sent interrupt. Waiting 5s before force killing
Jul 15, '22 00:06:03 +0200	Started	Task started by client

Relevant logs:

Log Date: 25/08/2022 07:45:24:734
Message: Container 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb failed to exit within 5 seconds of signal 15 - using the force

Log Date: 25/08/2022 07:45:34:767
Message: Container 10a7fec1ed3b failed to exit within 10 seconds of kill - trying direct SIGKILL

Log Date: 25/08/2022 07:45:34:767
Message: stopped container
container_id: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb

At this point after a few minutes I run:
$ docker rm -f <container-id>

Log Date: 25/08/2022 07:56:30:841
Message: Container 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb failed to exit within 3 seconds of signal 15 - using the force

Log Date: 25/08/2022 07:56:41:122
Message: Container 10a7fec1ed3b failed to exit within 10 seconds of kill - trying direct SIGKILL

Log Date: 25/08/2022 07:56:41:618
Message: failed to inspect container
container_id: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb
error: No such container: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb

Log Date: 25/08/2022 07:56:41:621
Message: container was removed out of band, will proceed with DestroyTask
container_id: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb
error: No such container: 10a7fec1ed3bb1c737a5abc4ff863acd3a881e3ccdb31b13fa84c17c82d3f5cb

I don't know if this is a docker problem, but it was killed correctly with the docker rm -f but nomad wasn't able to kill it.

I got a dump of the goroutines before the docker rm and noticed a couple that I think are relevant.

goroutine 146856643 [chan receive, 30035 minutes]:
github.com/fsouza/go-dockerclient.(*Client).hijack.func3()
	github.com/fsouza/[email protected]/client.go:862 +0x28
github.com/fsouza/go-dockerclient.waiterFunc.Wait(0x0)
	github.com/fsouza/[email protected]/client.go:725 +0x1a
github.com/fsouza/go-dockerclient.(*Client).StartExec(0x0, {0xc007535480, 0x0}, {{0x2cfa940, 0xc0006250a0}, {0x2cfa960, 0xc0006250b8}, {0x2cfa960, 0xc0006250c8}, 0x0, ...})
	github.com/fsouza/[email protected]/exec.go:105 +0x53
github.com/hashicorp/nomad/drivers/docker.(*Driver).ExecTaskStreaming(0xc000458280, {0x2d3aef8, 0xc00462ee00}, {0xc0069b7500, 0x3e}, 0xc000ae85a0)
	github.com/hashicorp/nomad/drivers/docker/driver.go:1624 +0x55b
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).ExecStreaming(0xc004321b40, {0x2d3aef8, 0xc00462ee00}, {0xc0013cbae0, 0x1, 0x1}, 0x2, {0x2d0b5b8, 0xc0096e3060})
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:88 +0x15a
github.com/hashicorp/nomad/client.(*Allocations).execImpl(0xc0000100a8, 0xc0009d1bc0, 0xc0083ea140, {0xc0071a3c80, 0x24})
	github.com/hashicorp/nomad/client/alloc_endpoint.go:265 +0xa23
github.com/hashicorp/nomad/client.(*Allocations).exec(0xc0000100a8, {0x7fa6cc14e350, 0xc0013cb9d0})
	github.com/hashicorp/nomad/client/alloc_endpoint.go:149 +0x23e
github.com/hashicorp/nomad/client.(*Client).handleStreamingConn(0xc0009b82c0, {0x2d67ac0, 0xc0013cb9d0})
	github.com/hashicorp/nomad/client/rpc.go:407 +0x5b9
github.com/hashicorp/nomad/client.(*Client).handleConn(0xc0009b82c0, {0x2d67ac0, 0xc0013cb9d0})
	github.com/hashicorp/nomad/client/rpc.go:339 +0x16f
created by github.com/hashicorp/nomad/client.(*Client).listenConn
	github.com/hashicorp/nomad/client/rpc.go:315 +0xa8


goroutine 115349747 [select, 8 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).handleKill(0xc0067d6580, 0xc004294360)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:963 +0x51b
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc0067d6580)
	github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:585 +0xf9f
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
	github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:375 +0x71

Could a exec session be left there and that is blocking the container from stopping?

@tgross
Copy link
Member

tgross commented Aug 25, 2022

@jorgemarey can you open a new issue for this?

@jorgemarey
Copy link
Contributor

Yes, of course!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants