Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System job keeps running after I try to remove it from a DC #11373

Closed
mikehardenize opened this issue Oct 22, 2021 · 4 comments · Fixed by #11391
Closed

System job keeps running after I try to remove it from a DC #11373

mikehardenize opened this issue Oct 22, 2021 · 4 comments · Fixed by #11391

Comments

@mikehardenize
Copy link

Nomad version

Nomad v1.1.5 (117a23d)

Operating system and Environment details

Centos 7

Issue

I have two nomad agents in different DCs. One in us-east4-a and another in us-east4-b.
I created a system job, but it only had datacenters = ["us-east4-a"] so it only ran on one of the agents.
I then updated the job to contain datacenters = ["us-east4-a", "us-east4-b"] and re-ran it. It then started running on both agents (as expected).
However, I then switched it back to datacenters = ["us-east4-a"] and re-ran the job, and it unexpectedly continued running on the us-east4-b agent.
When I do a "nomad status jobname" it has "Datacenters = us-east4-a" in the output, but it also lists an allocation for each agent:

# nomad status traefik
ID            = traefik
Name          = traefik
Submit Date   = 2021-10-22T09:26:15Z
Type          = system
Priority      = 50
Datacenters   = us-east4-a
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
traefik     0       0         2        7       25        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
aa4cea90  004fd7be  traefik     42       run      running  5m52s ago   5m48s ago
629a220d  51bd0d56  traefik     43       run      running  14d18h ago  3m17s ago
@DerekStrickland
Copy link
Contributor

Hi @mikehardenize,

Thanks for using Nomad! Would you mind posting your full job file (without any secrets) for me to take a look at?

@mikehardenize
Copy link
Author

job "traefik" {

    type = "system"

    datacenters = ["us-east4-a"]
    
    constraint {
        attribute = "${node.class}"
        value     = "job"
    }

    group "traefik" {
        
        network {
            port "http" {
                static = 80
            }
            port "https" {
                static = 443
            }
        }

        volume "traefik" {
            type      = "host"
            read_only = false
            source    = "traefik"
        }

        task "traefik" {
            driver = "docker"

            service {
                name = "traefik-http"
                port = "http"
                check {
                    type     = "http"
                    path     = "/ping"
                    interval = "5s"
                    timeout  = "2s"
                }
            }

            volume_mount {
                volume      = "traefik"
                destination = "/host"
                read_only   = false
            }

            config {
                image = "traefik:2.5"
                cap_add = ["net_raw"]
                ports        = ["http", "https"]
                network_mode = "host"
                dns_servers  = ["127.0.0.1"]
                auth_soft_fail = true
            }

        }
    }
}

@notnoop
Copy link
Contributor

notnoop commented Oct 25, 2021

Thanks @mikehardenize for reporting the bug. I was able to reproduce it and identify the causes. We'll have a fix PR soon.

notnoop pushed a commit that referenced this issue Oct 27, 2021
The system scheduler should leave allocs on draining nodes as-is, but
stop node stop allocs on nodes that are no longer part of the job
datacenters.

Previously, the scheduler did not make the distinction and left system
job allocs intact if they are already running.

I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 .

Fixes #11373
@tgross tgross added this to the 1.2.0 milestone Nov 8, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants