Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad has running deployments for non existant jobs #4520

Closed
wiedenmeier opened this issue Jul 18, 2018 · 4 comments
Closed

Nomad has running deployments for non existant jobs #4520

wiedenmeier opened this issue Jul 18, 2018 · 4 comments

Comments

@wiedenmeier
Copy link

Nomad has running deployments listed for jobs which were stopped several months ago and have long since been garbage collected. It should not be possible to have running deployments for non-existent jobs, they should have been canceled when the job was stopped and garbage collected.

Nomad Version: Nomad v0.8.3 (c85483da3471f4bd3a7c3de112e95f551071769f)

This causes error messages in the nomad server logs such as:

Jul 17 22:13:15 ip-____________.us-west-2.compute.internal nomad[17598]: 2018/07/17 22:13:15.576347 [ERR] nomad.deployments_watcher: failed to track deployment "80ea9bf5-5eb6-9555-e94a-af6920c2cfbc": deployment "80ea9bf5-5eb6-9555-e94a-af6920c2cfbc" references unknown job "zoolander-qa-_________________-job"

Jul 17 22:13:15 ip-____________.us-west-2.compute.internal nomad[17598]: 2018/07/17 22:13:15.576433 [ERR] nomad.deployments_watcher: failed to track deployment "952f1c85-fe80-8c73-9022-094d5143912e": deployment "952f1c85-fe80-8c73-9022-094d5143912e" references unknown job "zoolander-qa-_________________-job"

Jul 17 22:13:15 ip-____________.us-west-2.compute.internal nomad[17598]: 2018/07/17 22:13:15.576485 [ERR] nomad.deployments_watcher: failed to track deployment "a80eed84-068e-7b41-f858-911f80130fa9": deployment "a80eed84-068e-7b41-f858-911f80130fa9" references unknown job "zoolander-qa-_________________-job"
-[w-vm]- -[~]-
-[10:39 AM]-nomad deployment status 952f1c85-fe80-8c73-9022-094d5143912e
ID          = 952f1c85
Job ID      = zoolander-qa-_________________-job
Job Version = 1
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
service     1        0       0        0
-[w-vm]- -[~]-
-[10:41 AM]-nomad status zoolander-qa-_________________-job
Unable to resolve ID: "zoolander-qa-_________________-job"
-[w-vm]- -[~]-
-[10:41 AM]-curl ${NOMAD_ADDR}/v1/deployment/952f1c85-fe80-8c73-9022-094d5143912e | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   460  100   460    0     0   4509      0 --:--:-- --:--:-- --:--:--  4554
{
  "CreateIndex": 1031168,
  "ID": "952f1c85-fe80-8c73-9022-094d5143912e",
  "JobCreateIndex": 699544,
  "JobID": "zoolander-qa-_________________-job",
  "JobModifyIndex": 1029796,
  "JobVersion": 1,
  "ModifyIndex": 1031168,
  "Namespace": "default",
  "Status": "running",
  "StatusDescription": "Deployment is running",
  "TaskGroups": {
    "service": {
      "AutoRevert": false,
      "DesiredCanaries": 0,
      "DesiredTotal": 1,
      "HealthyAllocs": 0,
      "PlacedAllocs": 0,
      "PlacedCanaries": null,
      "Promoted": false,
      "UnhealthyAllocs": 0
    }
  }
}
-[w-vm]- -[~]-
-[10:42 AM]-curl ${NOMAD_ADDR}/v1/job/zoolander-qa-_________________-job
job not found-[w-vm]- -[~]-
-[10:42 AM]-
-[w-vm]- -[~]-
-[10:42 AM]-curl ${NOMAD_ADDR}/v1/job/zoolander-qa-_________________-job/deployments | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6455    0  6455    0     0  50826      0 --:--:-- --:--:-- --:--:-- 51230
[
  {
    "CreateIndex": 1030352,
    "ID": "0ec29870-d92d-2c81-e02a-ab398edc353c",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030352,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030298,
    "ID": "18e52c00-9609-2cf7-5375-19b5d06bfd18",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030298,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030321,
    "ID": "4cb195c7-8130-e5ef-cbb7-fe51d7b95a62",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030321,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030335,
    "ID": "4ec10131-5ba3-1ee7-81c3-38a684366dfa",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030335,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030348,
    "ID": "513f5a36-a590-30a8-453c-90f291f3027a",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030348,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030311,
    "ID": "5914c2db-7358-2c8a-334a-ea2114e62943",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030311,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030273,
    "ID": "5f53097b-7f48-02d4-197a-263aff49a2ff",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 1,
    "ModifyIndex": 1030273,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030051,
    "ID": "7bb4e814-67cd-acc3-6601-441890a3a1d8",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030051,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030383,
    "ID": "80ea9bf5-5eb6-9555-e94a-af6920c2cfbc",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030383,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1031168,
    "ID": "952f1c85-fe80-8c73-9022-094d5143912e",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 1,
    "ModifyIndex": 1031168,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030675,
    "ID": "a80eed84-068e-7b41-f858-911f80130fa9",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 1,
    "ModifyIndex": 1030675,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1031087,
    "ID": "af867b89-1faf-9588-420e-684576ed0ebf",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 1,
    "ModifyIndex": 1031087,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030308,
    "ID": "d4b0783e-84fe-0671-49fe-19b386323e25",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030308,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  },
  {
    "CreateIndex": 1030317,
    "ID": "f862ad54-e422-8e66-7af7-7e6aa8982843",
    "JobCreateIndex": 699544,
    "JobID": "zoolander-qa-_________________-job",
    "JobModifyIndex": 1029796,
    "JobVersion": 2,
    "ModifyIndex": 1030317,
    "Namespace": "default",
    "Status": "running",
    "StatusDescription": "Deployment is running",
    "TaskGroups": {
      "service": {
        "AutoRevert": false,
        "DesiredCanaries": 0,
        "DesiredTotal": 1,
        "HealthyAllocs": 0,
        "PlacedAllocs": 0,
        "PlacedCanaries": null,
        "Promoted": false,
        "UnhealthyAllocs": 0
      }
    }
  }
]
@chelseakomlo
Copy link
Contributor

chelseakomlo commented Jul 18, 2018

Thanks for reporting this. Can you please include further steps on how to reproduce this issue? For example, a sample job config, how the job was stopped/garbage collected, and the amount of time before you started to see this issue occur would all be helpful.

@wiedenmeier
Copy link
Author

Thanks for taking a look at this so quickly. Unfortunately, I do not have reliable steps to reproduce this issue, as it was recently found in our infrastructure, although in the next week or so I will hopefully have time to investigate this as well as #4299 which I suspect might be related, as both result in components of a job persisting after the job has ended and been garbage collected. In both cases there are 'orphaned' components of a job present that seem to reference the job that the belong to, but might not be referenced by the job that owns them. It should be easy enough to automatically detect when this has happened, I'll get a script posted here to do that when I get to it.

The job was most likely stopped via the nomad cli, but I can't verify 100% that because it happened quite a while ago. The job was stopped on March 7th, however the oldest error like this I can find in my logs is from yesterday at 16:47 UTC. I don't believe that this is a log retention issue, so I will try to determine what could have triggered these errors to start occurring. They have been frequent since they started. This is also happening on several other jobs, all with almost identical configs except for the customer/region

Here is the final version of that job file before it was stopped:

job "zoolander-qa-_________________-job" {
  datacenters = ["pdx2"]
  type = "service"

  group "service" {
    ephemeral_disk = {
      size = 110
    }
    task "zoolander-qa-_________________" {
      driver = "docker"
      config = {
        image = "172.31.26.8:5100/zoolander:qa"
        dns_servers = ["172.17.42.1"]
        dns_search_domains = ["service.consul"]
        force_pull = true
        port_map {
          http = 5000
        }
      }
      service {
        tags = [
          "versionable",
          "ENVIRONMENT:qa",
          "CUSTOMER:________",
          "REGION:________",
          "TAG:qa",
          "CODE:python"
        ]
        port = "http"
        name = "${TASK}"
        check {
          interval = "30s"
          timeout = "30s"
          type = "http"
          path = "/private/ruok"
        }
      }
      resources {
        memory = 4500
        cpu = 100
        network {
          mbits = 1
          port "http" {}
        }
      }
      meta {
        ENVIRONMENT = "qa"
        CUSTOMER = "________"
        REGION = "________"
        TAG = "qa"
      }
      env {
        CONSUL_HOST_ADDR = "172.17.42.1:8500"
        CONFIG_TEMPLATE_FILE = "default-prod.yaml.ctmpl"
        NOVI_APP_NAME = "${NOMAD_TASK_NAME}"
        NOVI_APP_ENVIRONMENT = "${NOMAD_META_ENVIRONMENT}"
        NOVI_APP_CUSTOMER = "${NOMAD_META_CUSTOMER}"
        NOVI_APP_REGION = "${NOMAD_META_REGION}"
      }
    }

    count = 1
    restart {
      attempts = 3
      interval = "30s"
      delay = "2s"
      mode = "delay"
    }
    constraint {
      distinct_hosts = true
    }
  }

  update {
    max_parallel = 1
    stagger = "2m"
  }
}

@dadgar
Copy link
Contributor

dadgar commented Jul 18, 2018

Hey in some earlier releases of Nomad we had a data race were multiple deployments could be created. This has since been remedied and 0.8.4 includes a fix to remove the leaked deployments: #4329

So please upgrade to 0.8.4. If this issue still is there, let us know and we can re-open

@dadgar dadgar closed this as completed Jul 18, 2018
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants