Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System job with constrains fails to plan #12748

Open
chilloutman opened this issue Apr 22, 2022 · 17 comments
Open

System job with constrains fails to plan #12748

chilloutman opened this issue Apr 22, 2022 · 17 comments
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling theme/system-scheduler type/bug

Comments

@chilloutman
Copy link

Nomad version

v1.2.6

(Nomad v1.2.6 has problem described below, while Nomad v1.1.5 works as expected.)

Operating system and Environment details

Nomad nodes are running Ubuntu. Docker driver is used for all tasks.

A set of nodes has node.class set to worker and there are few other nodes in the cluster.

Issue

System job with constrains fails to plan.

Reproduction steps

A job with type = "system" is used to schedule tasks on the worker nodes. So the following constraint is added to the worker group:

constraint {
  attribute = "${node.class}"
  operator  = "="
  value     = "worker"
}

Expected Result

All the worker nodes should run the worker task, all other nodes should not.

Actual Result

This works sometimes, in particular when there are no allocations on the cluster. But running nomad job plan after allocations are running displays the following warning:

Scheduler dry-run:
- WARNING: Failed to place allocations on all nodes.
  Task Group "worker" (failed to place 1 allocation):
    * Class "entry": 1 nodes excluded by filter
    * Constraint "${node.class} = worker": 1 nodes excluded by filter

This should not be a warning, as the allocations match the job definition, considering the constraints.
nomad job run produces the desired state and the job state is displayed as “not scheduled” on all non-worker nodes.

Removing the constrains shows no warning, but obviously schedules the worker task on non-worker nodes, which is unwanted.

The only workaround seems be to ignore warnings, which defeats the purpose of nomad job plan, or create a entire separate cluster for the workers.

Possibly related:

@cr0c0dylus
Copy link

cr0c0dylus commented Apr 25, 2022

I'm facing the same problem (1.2.6):

Job: "stage-cron"
Task Group: "cron" (1 ignore)
Task: "cron"

Scheduler dry-run:

  • WARNING: Failed to place allocations on all nodes.
    Task Group "cron" (failed to place 1 allocation):
    • Constraint "${meta.env} = stage": 5 nodes excluded by filter

But if I stop job before submitting a new job, it works as expected:

$ nomad job stop stage-cron
==> 2022-04-25T18:45:07+03:00: Monitoring evaluation "86e8c675"
2022-04-25T18:45:07+03:00: Evaluation triggered by job "stage-cron"
==> 2022-04-25T18:45:08+03:00: Monitoring evaluation "86e8c675"
2022-04-25T18:45:08+03:00: Evaluation status changed: "pending" -> "complete"
==> 2022-04-25T18:45:08+03:00: Evaluation "86e8c675" finished with status "complete"

$ nomad job plan ...

+/- Job: "stage-cron"
+/- Stop: "true" => "false"
Task Group: "cron" (1 create)
Task: "cron"

Scheduler dry-run:

  • All tasks successfully allocated.

@cr0c0dylus
Copy link

I have found a temporary workaround. You need to add 1.1.x server to the cluster and stop-start 1.2.6 leaders until 1.1.x becomes a leader.

@tgross
Copy link
Member

tgross commented May 2, 2022

Hi @chilloutman! This definitely seems like it could be related to #12016. I'm not going to mark it as a duplicate just in case it's not but I'll cross-reference here so that whomever tackles that issue will see this as well. I don't have a good workaround for you other than to ignore warnings (they're warnings and not errors), but I realize that isn't ideal.

Just FYI @cr0c0dylus:

I have found a temporary workaround. You need to add 1.1.x server to the cluster and stop-start 1.2.6 leaders until 1.1.x becomes a leader.

This is effectively downgrading Nomad into a mixed-version cluster, which is not supported and highly likely to result in state store corruption. Doing so in order to suppress something that's only a warning is not advised.

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label May 2, 2022
@cr0c0dylus
Copy link

Doing so in order to suppress something that's only a warning is not advised.

Unfortunately, it is not only a warning. It cannot allocate a job at all. Another trick - to change one of the limits in resources stanza. For example, to add +1 to the CPU limit. But it doesn't work with some of my jobs.

@ygersie
Copy link
Contributor

ygersie commented May 31, 2022

I wonder if this is related #11778 (comment) It really looks like some bug in the scheduler that incorrectly fails placement during the node feasibility check. It is almost like it's not iterating through all nodes but for some reason returns a placement failure while it hasn't exhausted the full list yet.

@lssilva
Copy link

lssilva commented Jun 7, 2022

I am also facing this issue and I had to downgrade nomad.

@chilloutman
Copy link
Author

I'm wondering if this could be the cause: https://github.com/hashicorp/nomad/pull/11111/files#diff-c4e3135b7aa83ba07d59d003a8ab006915207425b8728c4cf070eee20ab9157a

"// track node filtering, to only report an error if all nodes have been filtered" might not be working as intended. Or maybe instead of only warnings #11111 ended up causing errors.

@jmwilkinson
Copy link
Contributor

Verified we hit this with constraints on 1.2.6 as well.

Mitigation was reverting this to 1.1.5.

I do not know how bugs are prioritized but this should probably be pretty high.

@cr0c0dylus
Copy link

BTW, it would be great if I those warnings can be completely disabled in config. If I have 50 nodes in cluster and make constraint for 3 nodes - what the sense to see "47 Not Scheduled"? System jobs are very useful for scaling in HA configuration - I don't need to modify job stanza, just add or remove nodes with a special meta variable.

@dext0r
Copy link

dext0r commented Jun 30, 2022

I'm wondering if this could be the cause: https://github.com/hashicorp/nomad/pull/11111/files#diff-c4e3135b7aa83ba07d59d003a8ab006915207425b8728c4cf070eee20ab9157a

"// track node filtering, to only report an error if all nodes have been filtered" might not be working as intended. Or maybe instead of only warnings #11111 ended up causing errors.

It's the cause indeed. Reverting this pull request fixed the issue for me on 1.3.1.

@cr0c0dylus
Copy link

Nomad v1.2.9 (86192e4)

The problem persists. I still need to stop the 1.2.9 masters in sequence until 1.0.18 becomes the leader and allows deployment.

@jmwilkinson
Copy link
Contributor

There may be a fix in 1.3.2, at least it looks that way: https://github.com/hashicorp/nomad/blob/v1.3.2/scheduler/scheduler_system.go#L298

@seanamos
Copy link

Issue still exists in v1.5.3, frequently run into this when upgrading system jobs.

While the nomad CLI reports this error, the rollout will still actually happen in Nomad.

@nCrazed
Copy link

nCrazed commented Dec 20, 2023

I am seeing the same behavior as @seanamos in v1.6.3

@cr0c0dylus
Copy link

The problem continues to occur in v1.7.3

@elgatopanzon
Copy link

Can confirm still present in Nomad v1.7.7.

@josegonzalez
Copy link
Contributor

josegonzalez commented Sep 16, 2024

I'm seeing this with Nomad 1.8.3, but additionally its failing with not only plan but run.

I have 12 nodes running:

$ nomad node status -quiet
b15f6629-da08-0f17-8058-0a3032a769e1
31090485-dbe1-4b72-00bb-0e1282d82210
dc5177d0-7e07-28c2-8ddc-584be7c66c75
22a71c04-f531-7680-0019-b0e51bf83ba1
be37c669-d199-a716-2866-e4642aec3665
dcf03550-47ef-fe32-cbe1-67b711744608
d0e7cbfc-b934-ba23-54ff-cf38531c355a
3a7e8085-44bc-a150-6a1d-0040353a8528
2e01087a-f3a6-f86d-fd36-a18d86b92da2
8d8ba3e3-c180-9f1e-b2a2-08d42aad4e4d
ffbbb77f-744a-26f2-6a21-bf1d19316865
2f0b6be8-6564-6bbe-d85c-2457e532243f

I have a single system job with the following constraint (and no others):

  constraint {
    attribute = "${node.class}"
    value     = "private-t38"
  }

Which matches node ffbbb77f:

$ nomad node status ffbbb77f
ID              = ffbbb77f-744a-26f2-6a21-bf1d19316865
Name            = prod-ap-northeast-1-private-t38-i-SOME_INSTANCE_ID
Node Pool       = default
Class           = private-t38
...

When I try to place the job, it fails:

$ nomad job run job.nomad
==> 2024-09-16T19:04:26-04:00: Monitoring evaluation "6c4f7100"
    2024-09-16T19:04:26-04:00: Evaluation triggered by job "metadataproxy"
    2024-09-16T19:04:28-04:00: Evaluation status changed: "pending" -> "complete"
==> 2024-09-16T19:04:28-04:00: Evaluation "6c4f7100" finished with status "complete" but failed to place all allocations:
    2024-09-16T19:04:28-04:00: Task Group "app" (failed to place 1 allocation):
      * Class "private-nomad": 3 nodes excluded by filter
      * Class "private-common": 1 nodes excluded by filter
      * Class "public-dt316": 6 nodes excluded by filter
      * Constraint "${node.class} = private-t38": 11 nodes excluded by filter

The job itself is already running on the 1 node that matches that constraint. Once I stop the job, it can get placed.

One potential interesting detail is that the job itself hasn't changed in between invocations. If I change the job contents somehow, it'll place properly. This only happens when attempting to re-apply a job as it already exists - my guess is that it detects that the jobs haven't changed, and therefore marks that node as a conflict for some reason as opposed to "already placed" (dunno if there is a word for that).

Not sure if this is exactly the above issue, but happy to dive in further if folks think its related :)

@davemay99 davemay99 added the hcc/cst Admin - internal label Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling theme/system-scheduler type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests