Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preempted dispatch alloc is not replaced after resources become available #9890

Closed
tgross opened this issue Jan 26, 2021 · 3 comments · Fixed by #13205
Closed

Preempted dispatch alloc is not replaced after resources become available #9890

tgross opened this issue Jan 26, 2021 · 3 comments · Fixed by #13205
Assignees
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/batch Issues related to batch jobs and scheduling theme/preemption Issues related to preemption type/bug
Milestone

Comments

@tgross
Copy link
Member

tgross commented Jan 26, 2021

When a dispatch job allocation is displaced by pre-emption, it is never replaced even after resources become available.

Users have reported they expect the dispatch job to only be temporarily displaced. The evicted allocation should be replaced but in a blocked state while waiting for resources to become available, just as happens with placement failures.

This is borderline bug/enhancement because the behavior is not well-defined in the documentation, but it's certainly surprising to users.

To reproduce, run Nomad and enable batch preemption:

curl -XPUT -d '{"PreemptionConfig": {"BatchSchedulerEnabled": true }}' \
    "localhost:4646/v1/operator/scheduler/configuration"

Verify the resources available:

$ nomad node status -self
...
Allocated Resources
CPU          Memory       Disk
0/18424 MHz  0 B/1.9 GiB  0 B/44 GiB

Low-priority job:

job "low" {
  datacenters = ["dc1"]
  type        = "batch"

  parameterized {
    meta_optional = ["test"]
  }

  group "group" {

    task "task" {
      driver = "exec"
      config {
        command = "bash"
        args    = ["-c", "sleep 120"]
      }
      resources {
        memory = 1000
      }
    }
  }
}

High-priority job, with memory requirements that force pre-emption:

job "high" {
  datacenters = ["dc1"]
  type        = "batch"

  priority = 80

  parameterized {
    meta_optional = ["test"]
  }

  group "group" {

    task "task" {
      driver = "exec"
      config {
        command = "bash"
        args    = ["-c", "sleep 120"]
      }
      resources {
        memory = 1500
      }
    }
  }
}

Register both jobs.

$ nomad job run ./low.nomad
Job registration successful
$ nomad job run ./high.nomad
Job registration successful

Dispatch the low priority job and note that it's running:

$ nomad job dispatch low
Dispatched Job ID = low/dispatch-1611671592-fa7829a8
Evaluation ID     = b6423703

==> Monitoring evaluation "b6423703"
    Evaluation triggered by job "low/dispatch-1611671592-fa7829a8"
    Allocation "2ce23ac8" created: node "5ba86c7d", group "group"
==> Monitoring evaluation "b6423703"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "b6423703" finished with status "complete"

$ nomad job status
ID                                Type                 Priority  Status   Submit Date
high                              batch/parameterized  80        running  2021-01-26T09:33:04-05:00
low                               batch/parameterized  50        running  2021-01-26T09:33:01-05:00
low/dispatch-1611671592-fa7829a8  batch                50        running  2021-01-26T09:33:12-05:00

While that job is still running, dispatch the high priority job and note that the low-priority dispatched job is now dead because it's been evicted:

$ nomad job dispatch high
Dispatched Job ID = high/dispatch-1611671603-e1b8559d
Evaluation ID     = 1b8c7b4f

==> Monitoring evaluation "1b8c7b4f"
    Evaluation triggered by job "high/dispatch-1611671603-e1b8559d"
    Allocation "e0b4c1b2" created: node "5ba86c7d", group "group"
==> Monitoring evaluation "1b8c7b4f"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "1b8c7b4f" finished with status "complete"

$ nomad job status
ID                                 Type                 Priority  Status   Submit Date
high                               batch/parameterized  80        running  2021-01-26T09:33:04-05:00
high/dispatch-1611671603-e1b8559d  batch                80        running  2021-01-26T09:33:23-05:00
low                                batch/parameterized  50        running  2021-01-26T09:33:01-05:00
low/dispatch-1611671592-fa7829a8   batch                50        dead     2021-01-26T09:33:12-05:00

$ nomad job status low/dispatch-1611671592-fa7829a8
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
2ce23ac8  5ba86c7d  group       0        evict    complete  1m42s ago  1m31s ago

Wait for the high-priority job to complete and note that the low-priority job is not replaced:

$ nomad job status
ID                                 Type                 Priority  Status   Submit Date
high                               batch/parameterized  80        running  2021-01-26T09:33:04-05:00
high/dispatch-1611671603-e1b8559d  batch                80        dead     2021-01-26T09:33:23-05:00
low                                batch/parameterized  50        running  2021-01-26T09:33:01-05:00
low/dispatch-1611671592-fa7829a8   batch                50        dead     2021-01-26T09:33:12-05:00
@tgross tgross added type/bug stage/needs-investigation theme/batch Issues related to batch jobs and scheduling theme/preemption Issues related to preemption stage/accepted Confirmed, and intend to work on. No timeline committment though. hcc/cst Admin - internal and removed stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-investigation labels Jan 26, 2021
@Fuco1
Copy link
Contributor

Fuco1 commented Apr 29, 2022

What I observed is that it doesn't even need to be a dispatched job, any batch job would do.

I now started two batch jobs with count = 1400, one with higher priority and one with 10 lower. The allocations from the low priority job were quickly preempted, but then they never returned after the high priority job finished (and about 300 allocations went straight into a failed state and never recovered). In the end Nomad was telling me there is 1000 queued tasks but nothing was happening.

Nomad version is 1.2.6.

@mmcquillan mmcquillan modified the milestones: 1.3.2, 1.3.x May 17, 2022
@shoenig
Copy link
Member

shoenig commented May 31, 2022

Indeed I'm able to reproduce with just a normal batch job. It seems when the alloc is evicted, Nomad doesn't queue up a replacement.

nomad.hcl
client {
  enabled = true
}

server {
  enabled = true
  default_scheduler_config {
    preemption_config {
      service_scheduler_enabled = true
      batch_scheduler_enabled = true
    }
  }
}
low.nomad
job "low" {
  datacenters = ["dc1"]
  priority = 50
  type = "batch"

  group "group" {
    count = 3    
    task "sleep" {
      driver = "exec"

      config {
	command = "/bin/sleep"
	args = ["10000"]
      }

      resources {
        cpu    = 500
        memory = 10000
      }
    }
  }
}
high.nomad
job "low" {
  datacenters = ["dc1"]
  priority = 50
  type = "batch"

  group "group" {
    count = 3    
    task "sleep" {
      driver = "exec"

      config {
	command = "/bin/sleep"
	args = ["30"]
      }

      resources {
        cpu    = 500
        memory = 10000
      }
    }
  }
}
Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
group       0       0         2        0       1         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
01fd9a9e  5fedcd77  group       0        evict    complete  6m48s ago  6m24s ago
7125b2c9  5fedcd77  group       0        run      running   6m48s ago  6m44s ago
7cc0bebb  5fedcd77  group       0        run      running   6m48s ago  6m44s ago

@shoenig shoenig self-assigned this May 31, 2022
shoenig added a commit that referenced this issue Jun 2, 2022
This PR fixes a bug where an evicted batch job would not be rescheduled
once resources become available.

Closes #9890
shoenig added a commit that referenced this issue Jun 3, 2022
This PR fixes a bug where an evicted batch job would not be rescheduled
once resources become available.

Closes #9890
ChaiWithJai pushed a commit that referenced this issue Jun 3, 2022
This PR fixes a bug where an evicted batch job would not be rescheduled
once resources become available.

Closes #9890
@lgfa29 lgfa29 modified the milestones: 1.3.x, 1.3.2 Aug 24, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/batch Issues related to batch jobs and scheduling theme/preemption Issues related to preemption type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants