Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reschedule attempts limit is not respected when there is placement failures #12147

Closed
avandierast opened this issue Feb 28, 2022 · 16 comments · Fixed by #12319
Closed

Reschedule attempts limit is not respected when there is placement failures #12147

avandierast opened this issue Feb 28, 2022 · 16 comments · Fixed by #12319
Assignees
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/restart/reschedule type/bug
Milestone

Comments

@avandierast
Copy link

Nomad version

╰─ nomad version
Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)

Operating system and Environment details

Fedora 35 with nomad agent --dev
Also reproduced on ubuntu 20.04 with nomad server and client on two separated VMs.

Issue

When there is a placement failure of an allocation after a reschedule, the reschedule attempts number of the task group is not respected at all.
We have this problem with parametrized batch, we don't know if it's the same with other job types.

Reproduction

  • Run agent:
sudo nomad agent -dev -bind 0.0.0.0
  • Run a parametrized batch job with restart attempts = 0 and reschedule attempts > 0 that exhaust your resources and that always fail.
    For example with the following job definition (on a computer with less than 20GB of RAM for the node to be exhausted by the 10GB resources reservation):
job "example" {

  parameterized {
  }

  datacenters = ["dc1"]
  type = "batch"

  group "example" {

    reschedule {
      attempts       = 1
      interval       = "24h"
      unlimited      = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

    task "example" {
      driver = "docker"

      config {
        image = "alpine"
        entrypoint = ["/bin/sleep", "10" ,"&&", "false"]
      }

      resources {
        memory = 10000
      }

    }
  }
}
nomad job run example.nomad
  • Dispatch that job several times to make placement failures on reschedule.
nomad job dispatch example & nomad job dispatch example & nomad job dispatch example & nomad job dispatch example

Expected Result

The tasks are reschedule at least the number of attempts defined and then die, no matter if there was placement failure between the attempts.
For our example, we expect the container to be run only twice per dispatch.

Actual Result

The tasks are reschedule a lot before dying and sometimes never die.

We can see the problem easily in nomad UI in the job evaluations tab.
The job has a reschedule attempts of 1 per 24h but the dispatched job has 4 failures (alloc-failure with placement failures to False):

Capture d’écran de 2022-02-24 11-25-41

@Amier3
Copy link
Contributor

Amier3 commented Mar 2, 2022

Hey @avandierast

Thanks for bringing this to our attention! We'll look into this and let you know what we find 👍

@tgross
Copy link
Member

tgross commented Mar 7, 2022

Hi @avandierast! I just tested this myself and wasn't able to reproduce any unexpected behavior. There's two things that might be at play here:

  • The first is the easy one: the failed dispatched jobs don't get GC'd until they reach the job_gc_threshold. See Stopped periodic batch job clutters job list #11256. So when you're looking at the eval list in the UI, make sure those "extra" evals you're seeing are actually for the correct dispatched job. If you use nomad eval list it should make that fairly clear.

  • The second is a little more complicated. There's a subtle difference between rescheduling and re-evaluating that isn't particularly clear in the docs. So I just want to make sure I understand which of the two you're seeing. Supposing I have clients with only 1000MB of memory available, if I try to dispatch the following job:

jobspec
job "example" {

  parameterized {
  }

  datacenters = ["dc1"]
  type        = "batch"

  group "example" {

    reschedule {
      attempts  = 1
      interval  = "24h"
      unlimited = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

    task "example" {
      driver = "docker"

      config {
        image      = "alpine"
        entrypoint = ["/bin/sleep", "10", "&&", "false"]
      }

      resources {
        memory = 10000
      }

    }
  }
}

I'll end up with the following results:

$ nomad job dispatch example
Dispatched Job ID = example/dispatch-1646687115-e773ac03
Evaluation ID     = e09ee549

==> 2022-03-07T16:05:16-05:00: Monitoring evaluation "e09ee549"
    2022-03-07T16:05:16-05:00: Evaluation triggered by job "example/dispatch-1646687115-e773ac03"
==> 2022-03-07T16:05:17-05:00: Monitoring evaluation "e09ee549"
    2022-03-07T16:05:17-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-03-07T16:05:17-05:00: Evaluation "e09ee549" finished with status "complete" but failed to place all allocations:
    2022-03-07T16:05:17-05:00: Task Group "example" (failed to place 1 allocation):
      * Resources exhausted on 3 nodes
      * Class "vagrant" exhausted on 3 nodes
      * Dimension "memory" exhausted on 3 nodes
    2022-03-07T16:05:17-05:00: Evaluation "2e310f97" waiting for additional capacity to place remainder

This evaluation created a "blocked eval" (the eval 2e310f97) and the scheduler workers will re-evaluate it whenever resources have changed on the cluster. This blocked eval is not subject to the reschedule limit because the allocation has never been placed!

The reschedule block only impacts allocations that have actually been placed on a host and then subsequently failed. So if we change the resources block to the following:

resources {
  memory = 100
}

Then I get the following dispatched job:

$ nomad job status example/dispatch-1646687560-7c3ee652
ID            = example/dispatch-1646687560-7c3ee652
Name          = example/dispatch-1646687560-7c3ee652
Submit Date   = 2022-03-07T16:12:40-05:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
example     0       0         0        2       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
1617cbdc  ce37014d  example     0        run      failed  38s ago  37s ago
35b7559e  a1ee6964  example     0        stop     failed  43s ago  38s ago

So there's exactly one rescheduled allocation, and the allocation status for that final alloc looks like the following. Note the "Reschedule Attempts" line.

$ nomad alloc status 1617
ID                  = 1617cbdc-5cd9-9b55-c0c1-3ce3f48781ee
Eval ID             = 54f5f9d3
Name                = example/dispatch-1646687560-7c3ee652.example[0]
Node ID             = ce37014d
Node Name           = nomad-client1
Job ID              = example/dispatch-1646687560-7c3ee652
Job Version         = 0
Client Status       = failed
Client Description  = Failed tasks
Desired Status      = run
Desired Description = <none>
Created             = 1m10s ago
Modified            = 1m9s ago
Reschedule Attempts = 1/1

Task "example" is "dead"
Task Resources
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/100 MiB  300 MiB

Task Events:
Started At     = 2022-03-07T21:12:46Z
Finished At    = 2022-03-07T21:12:46Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type            Description
2022-03-07T16:12:46-05:00  Not Restarting  Policy allows no restarts
2022-03-07T16:12:46-05:00  Terminated      Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2022-03-07T16:12:46-05:00  Started         Task started by client
2022-03-07T16:12:46-05:00  Driver          Downloading image
2022-03-07T16:12:46-05:00  Task Setup      Building Task Directory
2022-03-07T16:12:46-05:00  Received        Task received by client

My evaluations list looks like the following:

$ nomad eval list
ID        Priority  Triggered By    Job ID                                Status    Placement Failures
54f5f9d3  50        alloc-failure   example/dispatch-1646687560-7c3ee652  complete  false
929282a3  50        alloc-failure   example/dispatch-1646687560-7c3ee652  complete  false
4992c684  50        job-register    example/dispatch-1646687560-7c3ee652  complete  false

@avandierast
Copy link
Author

avandierast commented Mar 9, 2022

Hello @tgross, thanks for your help analysing this problem :)

The screen I've pasted at the end of my first post is the equivalent of nomad eval list for one of the dispatched job.
How I interpret the eval list is the following, tell me if i'm wrong:

  • when there is is triggered-by==alloc-failure and placement-failures=true, the alloc is queued and there is a re-evaluation.
  • when there is triggered-by==alloc-failure and placement-failures==false, the job has failed and thus it will be re-scheduled.

If this is correct, we can see on my screenshot that there is 3 re-schedules but my job is configured to limit to one. And thus, the underlying docker container of the task is run 3 times.
The first time we have encountered this problem on a more complex infrastructure, the docker container was re-run indefinitely.

We see this behaviour when there is scarce resources and the evaluation after a reschedule fails. If the placements are successful two times in a row, then the job stops correctly after two fails.

If my example with 4 simultaneous batched job is not enough to have a failed evaluation after a reschedule, I think that if you run more than 4, maybe 10 or even more, it should trigger the problem.

@tgross
Copy link
Member

tgross commented Mar 14, 2022

How I interpret the eval list is the following, tell me if i'm wrong:

  • when there is is triggered-by==alloc-failure and placement-failures=true, the alloc is queued and there is a re-evaluation.
  • when there is triggered-by==alloc-failure and placement-failures==false, the job has failed and thus it will be re-scheduled.

Pretty close! Any time an allocation fails and it's eligible for a reschedule (ref node_endpoint.go#L1150 and structs.go#L9580-L9600, an evaluation is queued. The "eligible for reschedule" is doing a bit of work in that sentence, but that includes a job that hasn't been rescheduled more than its reschedule.attempts and isn't a system job.

So in both those cases there's been an allocation that failed and an evaluation created, which goes into the scheduler. The placement-failures value is probably an unfortunate name because we often use "placement failure" to mean the placement on the client. But what that value is the scheduler placement failure. As in, the scheduler tried to find a place for the allocation and could not. So it creates a new "blocked" eval and those are the evaluations that are shown in that UI view as queued-allocs

That means that a single allocation reschedule attempt could have 3 (or more!) evaluations and they can transition between different states. The timeline might look like:

  • The allocation fails.
  • The client sends a (*Node) UpdateAlloc RPC to the server.
  • The server checks the reschedule policy and if there are attempts remaining, it'll enqueue an eval with triggered-by alloc-failure
    • This will start with placement-failures false, because it hasn't been processed yet!
  • Then a scheduler worker will process it. Let's suppose we can't place all the allocations, which means its placement-failures gets set to true.
  • The scheduler creates a new blocked eval (with trigered-by queued-allocs).
    • That eval might succeed, so its placement-failures will be false.
    • Or it might fail to place and have placement-failures true, at which point another blocked evaluation will get created.

That's 3 evaluations for a single reschedule check. It isn't until the new allocation is created and placed that we treat it as one "attempt" for purposes of reschedule.attempts.

@avandierast
Copy link
Author

Thanks for the detailed explanation :)

Yes, I see there is 3 evaluations in the list for one real attempt. I didn't talk about the queued-allocs since it doesn't give much more information after a placement-failure=True for my problem. If you look at my eval list in the first post, there is 10 evaluations: 3x3 + 1 -> 3 evaluations per attempts where there was a failing placement + the last failing attempt.
This is the eval list for only one dispatched job. And I really need that there is only 2 attempts and not 4.

If this eval list isn't enough to show that there is more than 2 attempts, I can make another example where the container writes into a file and we can see that there is too much attempts.

The nomad configuration file for the agent:

data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"
bind_addr = "0.0.0.0"

datacenter = "dc1"

plugin "docker" {
  config {
    volumes {
        enabled = true
    }
  }
}

server {
  enabled = true
  bootstrap_expect = 1
}

client {
  enabled = true
  host_volume "host_files" {
    path      = "/opt/nomad/host_files"
    read_only = false
  }
}

The modified job file to have a container that writes to a file on the host:

job "example" {

  parameterized {
  }

  datacenters = ["dc1"]
  type = "batch"

  group "example" {

    reschedule {
      attempts       = 1
      interval       = "24h"
      unlimited      = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

    volume "host_files" {
      type      = "host"
      read_only = false
      source    = "host_files"
    }

    task "example" {
      driver = "docker"

      volume_mount {
        volume      = "host_files"
        destination = "/host_files"
        read_only   = false
      }

      config {
        image = "alpine"
        entrypoint = ["/bin/sh", "-c", "echo ${NOMAD_JOB_ID} >> /host_files/data && sleep 10 && false"]
      }

      resources {
        memory = 10000
      }

    }
  }
}

My list of jobs:

╰─ nomad status
ID                                    Type                 Priority  Status   Submit Date
example                           batch/parameterized  50        running  2022-03-16T11:19:42+01:00
example/dispatch-1647425989-04c89364  batch                50        pending  2022-03-16T11:19:49+01:00
example/dispatch-1647425989-4e26f991  batch                50        pending  2022-03-16T11:19:49+01:00
example/dispatch-1647425989-5cbba383  batch                50        running  2022-03-16T11:19:49+01:00
example/dispatch-1647425989-bd4607e2  batch                50        pending  2022-03-16T11:19:49+01:00

And after some time with the 4 dispatched job, the file on the host:

╰─ cat /opt/nomad/host_files/data
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364

And I can let it run for a long time, it will continue to make attempts and write job_ids in the file...

@tgross
Copy link
Member

tgross commented Mar 16, 2022

If this eval list isn't enough to show that there is more than 2 attempts, I can make another example where the container writes into a file and we can see that there is too much attempts.

It might be but it's hard to map the screenshot to jobs without having the list of allocations for the job as well. So the example you've given is very helpful.

And after some time with the 4 dispatched job, the file on the host:

Ok that's very interesting, as I see you've got 7 allocation runs per job there in the output. To try to replicate, I took your jobspec and I modified it in three small ways:

  • I had to change the volume.host_files.source for my cluster.
  • I changed the resources.memory to fit on my client (otherwise dispatch always fails).
  • I changed the shell script to write both the job ID and the alloc ID so we can verify these are reschedules and not simply restarts (I see you have restart.attempts = 0, but hey as long as we suspect there's buggy behavior anyways we should make sure it's not there).
jobspec
job "example" {

  parameterized {
  }

  datacenters = ["dc1"]
  type        = "batch"

  group "example" {

    reschedule {
      attempts  = 1
      interval  = "24h"
      unlimited = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

    volume "host_files" {
      type      = "host"
      read_only = false
      source    = "shared_data"
    }

    task "example" {
      driver = "docker"

      volume_mount {
        volume      = "host_files"
        destination = "/host_files"
        read_only   = false
      }

      config {
        image      = "alpine"
        entrypoint = ["/bin/sh", "-c", "echo \"job ${NOMAD_JOB_ID} - alloc ${NOMAD_ALLOC_ID}\" >> /host_files/data && sleep 10 && false"]
      }

      resources {
        memory = 128
      }

    }
  }
}

I register that job and dispatch it 4 times. I end up with the expected list of job status and evaluations:

$ nomad job status
ID                                    Type                 Priority  Status   Submit Date
example                               batch/parameterized  50        running  2022-03-16T09:17:58-04:00
example/dispatch-1647436692-731197a9  batch                50        dead     2022-03-16T09:18:12-04:00
example/dispatch-1647436697-5400bcb9  batch                50        dead     2022-03-16T09:18:17-04:00
example/dispatch-1647436699-06971e3d  batch                50        dead     2022-03-16T09:18:19-04:00
example/dispatch-1647436701-3cd48692  batch                50        dead     2022-03-16T09:18:21-04:00

$ nomad eval list
ID        Priority  Triggered By   Job ID                                Status    Placement Failures
a2d9100a  50        alloc-failure  example/dispatch-1647436701-3cd48692  complete  false
50fd4401  50        alloc-failure  example/dispatch-1647436701-3cd48692  complete  false
5abb936f  50        alloc-failure  example/dispatch-1647436699-06971e3d  complete  false
099099c8  50        alloc-failure  example/dispatch-1647436699-06971e3d  complete  false
8ead57e4  50        alloc-failure  example/dispatch-1647436697-5400bcb9  complete  false
d2d2febc  50        alloc-failure  example/dispatch-1647436697-5400bcb9  complete  false
23c826b6  50        alloc-failure  example/dispatch-1647436692-731197a9  complete  false
369f22c9  50        alloc-failure  example/dispatch-1647436692-731197a9  complete  false
cd9872da  50        job-register   example/dispatch-1647436701-3cd48692  complete  false
3179f8f8  50        job-register   example/dispatch-1647436699-06971e3d  complete  false
d8a5d1a2  50        job-register   example/dispatch-1647436697-5400bcb9  complete  false
6720a21d  50        job-register   example/dispatch-1647436692-731197a9  complete  false

And the contents of the data file are exactly what I'd expect. 2 allocations run for each job:

job example/dispatch-1647436692-731197a9 - alloc 3cdff6bf-181b-b914-558b-d6083d34c78d
job example/dispatch-1647436697-5400bcb9 - alloc 44e49152-57cf-75bd-ad0d-0855e560432c
job example/dispatch-1647436699-06971e3d - alloc d92df661-489f-2000-84d4-3280bf9d9a76
job example/dispatch-1647436701-3cd48692 - alloc 860a6808-10ed-04f5-9758-aba6f200b4e6
job example/dispatch-1647436692-731197a9 - alloc 09fbb62e-b2a1-5e22-eb9e-22174e9fc2d9
job example/dispatch-1647436697-5400bcb9 - alloc c04bdf71-d196-ce15-6f63-3f37c3e18390
job example/dispatch-1647436699-06971e3d - alloc 4dd0ac98-cb55-9cda-2570-41614bedbf71
job example/dispatch-1647436701-3cd48692 - alloc 65857c11-ebd6-7aa4-1969-f39ad86a0890

I've double checked this with both Nomad 1.2.6 and the current main and get the same result reliably. I'm assuming your Nomad binary is the one packaged by HashiCorp, and not something from your distros, right? Although I have no reason to believe it makes a difference, I've also tried with a standalone single server, in -dev mode, and with a full 3 server cluster + client. Works every time. 😕

But it sounds you're seeing this problem consistently at your end too, which suggests it's not some weird concurrency problem.

The job never ends up getting marked as dead? Can you run with debug logs and then run a single dispatch, and watch the logs until it's done at least one attempt it shouldn't have? Maybe there's something in the logs that'll give us a clue here.

@avandierast
Copy link
Author

avandierast commented Mar 16, 2022

I have also tried today on a MacBook Pro with the binary from nomad and it looped like expected 😅

What triggers me with the result of your nomad eval list is that you never have any placement failures and the bug is only when there is a placement failure after a failing attempt.
The memory limit was set to 10GB on my job because it is more than half of what my computer has. Thus the memory is exhausted after the placement of the first dispatched job and the evaluation for the three other dispatched jobs fails.

I've used your job entrypoint. I've dispatched one job alone and it worked well. Then I've dispatched 4 at the same time and get back the infinite loop of attempts: they are never dead.

nomad status
╰─ nomad status
ID                                    Type                 Priority  Status   Submit Date
example                               batch/parameterized  50        running  2022-03-16T16:52:06+01:00
example/dispatch-1647445933-dccf0f7f  batch                50        dead     2022-03-16T16:52:13+01:00
example/dispatch-1647445986-4b40a141  batch                50        pending  2022-03-16T16:53:06+01:00
example/dispatch-1647445986-51ccfdf4  batch                50        pending  2022-03-16T16:53:06+01:00
example/dispatch-1647445986-604d27ad  batch                50        running  2022-03-16T16:53:06+01:00
example/dispatch-1647445986-76c80d9d  batch                50        pending  2022-03-16T16:53:06+01:00

We can see that it worked well when the job was dispatched once, but not when there was 4:

result:
╰─ cat /opt/nomad/host_files/data
job example/dispatch-1647445933-dccf0f7f - alloc 62f2427f-2aec-a604-2e73-00e6c65d545c
job example/dispatch-1647445933-dccf0f7f - alloc e3cfda26-231b-4077-2216-c8cc26e8254d
job example/dispatch-1647445986-4b40a141 - alloc 5f51afc4-9f92-4696-a8d9-bb0a9387538b
job example/dispatch-1647445986-604d27ad - alloc 4a5b6487-60d1-7532-1d8e-d081b2d0eee0
job example/dispatch-1647445986-76c80d9d - alloc e8fe54a2-90bf-c7c0-cce6-d77a7346d5d3
job example/dispatch-1647445986-51ccfdf4 - alloc 69d7152c-dbb7-d999-7fe4-98cd5c9b5782
job example/dispatch-1647445986-4b40a141 - alloc 783ee4fa-4cfe-7d44-a464-51aad001e02e
job example/dispatch-1647445986-604d27ad - alloc e650e876-318c-9f46-2c4b-e8534802acd7
job example/dispatch-1647445986-76c80d9d - alloc 548ab502-4ca9-0210-c5d6-25d5d77ba0e6
job example/dispatch-1647445986-51ccfdf4 - alloc c883608d-b64c-a518-7ab8-f899e1c97d48
job example/dispatch-1647445986-4b40a141 - alloc ef930cb2-0cfd-d8e7-5a7b-5183be396438
job example/dispatch-1647445986-604d27ad - alloc f4e34f11-2a87-aa9c-15b3-c037eb607c1d
job example/dispatch-1647445986-76c80d9d - alloc 1b10f8fe-112a-fbc1-04c2-7692dcd946e3
job example/dispatch-1647445986-51ccfdf4 - alloc 9a1cd351-79b7-5966-999a-e391982e547e
job example/dispatch-1647445986-4b40a141 - alloc 436306c8-9d31-3385-354a-9a0f43eb4047
job example/dispatch-1647445986-604d27ad - alloc 7f4a4a1f-46aa-91c4-dc92-6cd014c1159a
job example/dispatch-1647445986-76c80d9d - alloc fa1beb45-c43d-1081-fb88-0c6fbb0dcc41
job example/dispatch-1647445986-51ccfdf4 - alloc 728b23e2-1208-bb83-5476-f454fd9f5e49
job example/dispatch-1647445986-604d27ad - alloc 8ea8ede7-3ec5-8cd4-99b2-cc0da5d5cab4
job example/dispatch-1647445986-4b40a141 - alloc 92b97b6a-cfae-79d3-fc79-9491516a0e43
job example/dispatch-1647445986-76c80d9d - alloc f0d110b6-0984-15f1-e4fa-35edfeebed9e
job example/dispatch-1647445986-51ccfdf4 - alloc 2a5725ee-7c22-4c66-25f8-c4121da3fd41
job example/dispatch-1647445986-604d27ad - alloc c56336ab-4043-3eb6-5b1f-f1bfce020605
job example/dispatch-1647445986-4b40a141 - alloc f6c81180-4603-a54e-6021-a5bf2b257e91
job example/dispatch-1647445986-76c80d9d - alloc a564f0a6-2337-70ef-0dc4-431caa4f877e
job example/dispatch-1647445986-51ccfdf4 - alloc c0282383-39c3-9721-9ecb-5806a1377800
job example/dispatch-1647445986-604d27ad - alloc 384564b4-ebb7-775f-bc7a-8636a4cd6a19

I've attached the logs of nomad if you can find anything usefull to understand what is happening:
nomad.log

I'm sorry I closed the nomad agent and forgot to call nomad eval list :(

@tgross
Copy link
Member

tgross commented Mar 17, 2022

What triggers me with the result of your nomad eval list is that you never have any placement failures and the bug is only when there is a placement failure after a failing attempt.
The memory limit was set to 10GB on my job because it is more than half of what my computer has. Thus the memory is exhausted after the placement of the first dispatched job and the evaluation for the three other dispatched jobs fails.

Ah, yes that's I've been missing! I was finally able to reproduce by setting memory = 1200 (my node has 1983MB available), and then dispatching a handful in a row without waiting. I end up with lots and lots of evals, and jobs that get rescheduled multiple times:

$ nomad job status example/dispatch-1647539230-3e910a86
ID            = example/dispatch-1647539230-3e910a86
Name          = example/dispatch-1647539230-3e910a86
Submit Date   = 2022-03-17T13:47:10-04:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
example     1       0         0        4       0         0

Placement Failure
Task Group "example":
  * Resources exhausted on 1 nodes
  * Dimension "memory" exhausted on 1 nodes

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created    Modified
05cdd5cd  8b4cb8ac  example     0        stop     failed  23s ago    7s ago
1f785e6a  8b4cb8ac  example     0        stop     failed  1m17s ago  1m1s ago
fc650e09  8b4cb8ac  example     0        stop     failed  2m10s ago  1m55s ago
1b4b7597  8b4cb8ac  example     0        stop     failed  3m4s ago   2m49s ago

I went through the logs you provided with a fine-toothed comb and there's enough interleaving to make it hard to see where the issue is. Fortunately, I was able to reproduce with just two dispatches. I've broken that down by hand into a detailed walk through the life of the evaluations (which I've left at the bottom of the page in a <details> block because there's quite a bit of it).

The tl;dr of all that is that we process an eval for a rescheduleable alloc we create a "follow-up" eval. If that eval ends up getting blocked it looks like we're losing track of reschedules at that point. When that happens and we get the (*Node) UpdateAlloc for the failed allocation, the check for rescheduling eligibility (ref node_endpoint.go#L1150 and structs.go#L9580-L9600) passes when it should not. This suggests to me that the RestartTracker that we attach to the allocation is getting wiped somehow. I haven't figured that out yet and I'm getting towards the end of my day, so I'll have to pick up the investigation from there tomorrow.

In any case, I suspect that the reason this is getting into a loop is because the two (or more) dispatched jobs just end up interleaving evaluations and so they keep knocking each other into the bad state.

(new job dispatch sent)

Evals in heap:

  • queued (Eval 80c3e7d2 / Dispatch01 / job-register)

T+11.643Z schedule dequeues 80c3e7d2
T+11.647Z scheduler processes 80c3e7d2D results=(place 1)
T+11.650Z scheduler submits plan for 80c3e7d2
T+11.651Z scheduler sets status for 80c3e7d2 to complete
T+11.653Z scheduler acks 80c3e7d2
T+11.657Z client starts task (alloc 271a5bc7 Dispatch01 attempt 0)

(new job dispatch sent)

Evals in heap:

  • queued (Eval ba4f9726 / Dispatch02 / job-register)

T+12.185Z scheduler dequeues ba4f9726
T+12.185Z scheduler processes ba4f9726 results=(place 1)
T+12.190Z scheduler fails to place all allocs for ba4f9726 and creates blocked eval 14b2581c

Evals in heap:

  • blocked (Eval 14b2581c / Dispatch02 / queued-allocs)
  • processing (Eval ba4f9726 / Dispatch02 / job-register)

T+12.190Z scheduler sets status for ba4f9726 to complete
T+12.191Z scheduler acks ba4f9726

Evals in heap:

  • blocked (Eval 14b2581c / Dispatch02 / queued-allocs)

T+22.335Z task fails, client does not restart (alloc 271a5bc7 Dispatch01 attempt 0)
T+22.548Z server gets update and creates new eval b31c782b for reschedule

Evals in heap:

  • queued (Eval b31c782b / Dispatch01 / alloc-failure)
  • blocked (Eval 14b2581c / Dispatch02 / queued-allocs)

T+22.550Z scheduler dequeues b31c782b
T+22.551Z scheduler processes b31c782b results=(ignore 1)
T+22.553Z scheduler found rescheduleable allocs for b31c782b creates followup eval b2d66c9e

Evals in heap:

  • follow-up (Eval b2d66c9e / Dispatch01 / alloc-failure)
  • processing (Eval b31c782b / Dispatch01 / alloc-failure)
  • blocked (Eval 14b2581c / Dispatch02 / queued-allocs)

T+22.556Z scheduler submits plan for b31c782b
T+22.557Z scheduler sets status for b31c782b to complete
T+22.558Z scheduler acks b31c782b

Evals in heap:

  • follow-up (Eval b2d66c9e / Dispatch01 / alloc-failure)
  • blocked (Eval 14b2581c / Dispatch02 / queued-allocs)

T+22.559Z scheduler dequeues 14b2581c (the previously blocked Dispatch02 / queued-allocs)
T+22.559Z scheduler processes 14b2581c results=(place 1)
T+22.564Z scheduler submits plan for 14b2581c
T+22.564Z scheduler sets status for 14b2581c to complete
T+22.566Z scheduler acks 14b2581c

Evals in heap:

  • follow-up (Eval b2d66c9e / Dispatch01 / alloc-failure)

T+22.569Z client starts task (alloc 160fe342 Dispatch02 attempt 0)
T+27.335Z scheduler dequeues b2d66c9e (follow-up for Dispatch01 / alloc-failure)
T+27.336Z scheduler processes b2d66c9e results=(place 1) (stop 1)
T+27.340Z scheduler fails to place all allocs for b2d66c9e and creates blocked eval 48378aed

Evals in heap:

  • processing (Eval b2d66c9e / Dispatch01 / alloc-failure)
  • blocked (Eval 48378aed / Dispatch01 / queued-allocs)

T+27.342Z scheduler submits plan for b2d66c9e
T+27.342Z scheduler sets status for b2d66c9e to complete
T+27.343Z scheduler acks b2d66c9e

Evals in heap:

  • blocked (Eval 48378aed / Dispatch01 / queued-allocs)

T+33.246Z task fails, client does not restart (alloc 160fe342 Dispatch02 attempt 0)
T+33.308Z server gets update and creates new eval c4236b42 for reschedule

Evals in heap:

  • queued (Eval c4236b42 / Dispatch02 / alloc-failure)
  • blocked (Eval 48378aed / Dispatch01 / queued-allocs)

T+33.309Z scheduler dequeues c4236b42
T+33.309Z scheduler processes c4236b42 results=(ignore 1)
T+33.309Z scheduler found rescheduleable allocs for c4236b42 creates followup eval 54c731a8

Evals in heap:

  • follow-up (Eval 54c731a8 / Dispatch02 / alloc-failure)
  • processing (Eval c4236b42 / Dispatch02 / alloc-failure)
  • blocked (Eval 48378aed / Dispatch01 / queued-allocs)

T+33.310Z scheduler submits plan for c4236b42
T+33.310Z scheduler sets status for c4236b42 to complete
T+33.310Z scheduler acks c4236b42

Evals in heap:

  • follow-up (Eval 54c731a8 / Dispatch02 / alloc-failure)
  • blocked (Eval 48378aed / Dispatch01 / queued-allocs)

T+33.310Z scheduler dequeues 48378aed
T+33.310Z scheduler processes 48378aed results=(place 1)
T+33.310Z scheduler submits plan for 48378aed
T+33.310Z scheduler sets status for 48378aed to complete
T+33.311Z scheduler acks 48378aed

Evals in heap:

  • follow-up (Eval 54c731a8 / Dispatch02 / alloc-failure)

T+33.317Z client starts task (alloc 584d2282 Dispatch01 attempt 1)
T+38.246Z scheduler dequeues 54c731a8
T+38.247Z scheduler processes 54c731a8 results=(place 1) (stop 1)
T+38.250Z scheduler fails to place all allocs for 54c731a8 and creates blocked eval 9ee89c4e

Evals in heap:

  • processing (Eval 54c731a8 / Dispatch02 / alloc-failure)
  • blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs)

T+38.251Z scheduler submits plan for 54c731a8
T+38.251Z scheduler sets status for 54c731a8 to complete
T+38.252Z scheduler acks 54c731a8

Evals in heap:

  • blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs)

T+43.921Z task fails, client does not restart (alloc 584d2282 Dispatch01 attempt 1)
T+44.067Z server gets update and creates new eval f23ebc2a for reschedule <--- this is wrong!!!

Evals in heap:

  • queued (Eval f23ebc2a / Dispatch01 / alloc-failure)
  • blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs)

T+44.069Z scheduler dequeues f23ebc2a
T+44.069Z scheduler processes f23ebc2a results=(ignore 1)
T+44.070Z scheduler found rescheduleable allocs for f23ebc2a creates followup eval 0a43a152

Evals in heap:

  • processing (Eval f23ebc2a / Dispatch01 / alloc-failure)
  • follow-up (Eval 0a43a152 / Dispatch01 / alloc-failure)
  • blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs)

T+44.071Z scheduler submits plan for f23ebc2a
T+44.072Z scheduler sets status for f23ebc2a to complete
T+44.073Z scheduler acks f23ebc2a

Evals in heap:

  • follow-up (Eval 0a43a152 / Dispatch01 / alloc-failure)
  • blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs)

T+44.073Z scheduler dequeues 9ee89c4e
T+44.074Z scheduler processes 9ee89c4e results=(place 1)
T+44.077Z scheduler submits plan for 9ee89c4e
T+44.077Z scheduler sets status for 9ee89c4e to complete
T+44.078Z scheduler acks evaluation triggered_by=queued-allocs

Evals in heap:

  • follow-up (Eval 0a43a152 / Dispatch01 / alloc-failure)
  • blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs)

T+44.083Z client starts task (alloc e3ded3d6 Dispatch02 attempt 1)

(and so on)

@tgross
Copy link
Member

tgross commented Mar 17, 2022

@avandierast I think I've got enough here to try to write a unit test in the scheduler package that'll reproduce this. I'll make an attempt at that tomorrow (assuming no other interruptions, of course 😁 )

@tgross
Copy link
Member

tgross commented Mar 18, 2022

Ok, I've got a failing unit test in #12319. While working on that I realized this bug is likely to impact a feature that my colleague @DerekStrickland is currently working on, so I'm going to pair up with him to work on a solution starting early next week.

@avandierast
Copy link
Author

Hello @tgross. Happy to hear that you were able to reproduce :)

@next-jesusmanuelnavarro
Copy link

I think I may found a similar situation that leads to the opposite result. In my case is also a batch job (meant to be run daily).

The job "hangs" at the templating step (in my case, because it can't find a vault-based secret which is not there) and never retries.

I mean:

  1. at first run, it tries to template with vault secret, (some/path/secret) and stays in templating status never to fail.
  2. If I go and create the 'some/path/secret' secret at vault, the batch job doesn't launch 24 later because of the first "blocked" run.

Can this behaviour be another aspect of this issue?

The 'restart' stanza from my job follows (while testing this I set the job to run every 5min). Current restart mode is "fail" but I also tested with 'mode=delay' with same results:

periodic {
	cron             = "*/5 * * * *"
	prohibit_overlap = "true"
}
(...)
restart {
	interval = "5m"
	attempts = 3
	delay    = "30s"
	mode     = "fail"
}

@tgross
Copy link
Member

tgross commented Apr 6, 2022

Hi @next-jesusmanuelnavarro! The restart block is all about restarting on the same client without going back to the scheduler. This issue is about the reschedule block, which is what happens after we give up on restart and have to go back to the scheduler. So that's unrelated and I've split it out to #12478

@avandierast the bug didn't turn out to be related to Derek's work, which just got merged. But we had an internal discussion about the bug and I think there's a couple approaches we can try, so I'll try to circle back to this issue shortly.

@tgross tgross removed their assignment Jun 6, 2022
@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jun 6, 2022
@louievandyke louievandyke added the hcc/cst Admin - internal label Dec 1, 2022
@tgross tgross self-assigned this May 28, 2024
@tgross
Copy link
Member

tgross commented May 31, 2024

Quick update on this. I've got a working patch at #12319, but there's a lingering issue described in #20462 (comment) which may be interrelated and I want to make sure I've done my due diligence here before trying to close this out.

tgross added a commit that referenced this issue Jun 4, 2024
When an allocation fails it triggers an evaluation. The evaluation is processed
and the scheduler sees it needs to reschedule, which triggers a follow-up
eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement
alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated).

But in the case where the follow-up eval can't place all allocs (there aren't
enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then
creates a blocked eval. The plan applier stops the failed alloc. Then when the
blocked eval is processed, the job is missing an allocation, so the scheduler
creates a new allocation. This allocation is _not_ a replacement from the
perspective of the scheduler, so it's not handed off a `RescheduleTracker`.

This changeset fixes this by annotating the reschedule tracker whenever the
scheduler can't place a replacement allocation. We check this annotation for
allocations that have the `stop` desired status when filtering out allocations
to pass to the reschedule tracker. I've also included tests that cover this case
and expands coverage of the relevant area of the code.

Fixes: #12147
Fixes: #17072
@tgross
Copy link
Member

tgross commented Jun 4, 2024

This issue should be resolved in #12319

@tgross tgross added this to the 1.8.1 milestone Jun 4, 2024
tgross added a commit that referenced this issue Jun 10, 2024
…12319)

When an allocation fails it triggers an evaluation. The evaluation is processed
and the scheduler sees it needs to reschedule, which triggers a follow-up
eval. The follow-up eval creates a plan to `(stop 1) (place 1)`. The replacement
alloc has a `RescheduleTracker` (or gets its `RescheduleTracker` updated).

But in the case where the follow-up eval can't place all allocs (there aren't
enough resources), it can create a partial plan to `(stop 1) (place 0)`. It then
creates a blocked eval. The plan applier stops the failed alloc. Then when the
blocked eval is processed, the job is missing an allocation, so the scheduler
creates a new allocation. This allocation is _not_ a replacement from the
perspective of the scheduler, so it's not handed off a `RescheduleTracker`.

This changeset fixes this by annotating the reschedule tracker whenever the
scheduler can't place a replacement allocation. We check this annotation for
allocations that have the `stop` desired status when filtering out allocations
to pass to the reschedule tracker. I've also included tests that cover this case
and expands coverage of the relevant area of the code.

Fixes: #12147
Fixes: #17072
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/restart/reschedule type/bug
Projects
Development

Successfully merging a pull request may close this issue.

5 participants