-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Ranker that would facilitate "fairly" spreading job groups to eligible nodes #3690
Comments
Hi, Maybe you could use distinct_property or one of the other constraints to ensure a proper spread of job - thats I've done it historically. Example with max 2 of your job type per instance
Or simply put the resource requirements high enough (e.g. 51% of the instance) so nomad never co-locate them :) |
Hey @jippi, I think Also, re: resources requirements, nodes always run hot anyway, so it probably won't help. Thanks |
In my experience, nomad will do a pretty good job at spreading allocs of the same type across multiple nodes out of the box - i would test with In my experience, if you have a count of 5, and all 5 allocs fit on one instance, nomad will not put all 5 on that instance anyway, but will try to ensure a spread within the same job, to ensure losing one node won't take down all allocs of a job. There is some anti entropy going on as well. I got jobs similar to yours, and with the above config example, i've never seen nomad put all eggs in one basket :) |
This is mostly about making optimal use of the nodes, not so much about reducing likelihood of service disruption by say, placing multiple instances on a node and node having issues and end up losing whatever was served by that node(thought that's definitely important). Thank you though. We will try that configuration, see how far it will take us :) |
It would be nice to choose a spread algorithm vs a pack. distinct doest help when you have more jobs than nodes. imagine some cpu intensive batch job or process that you want to spread out vs taking down a single node, or simply you don't want to have multiple different jobs schedule on the same nodes - i..e you want to simply spread any blast radius / impact of node failure or spare the same node from constant deployment churn. you also wont be subject to a docker engine failure on the one node nomad is choosing and hence all jobs just go into a black hole because the node is bad and the jobs fail - lets say docker issue. |
Anti-affinity / spread-algo is definitely something we intend to support in the future. Doubt it will be in the next (0.8) release though. We've had interest from other users as well who are currently using |
It would be nice to choose a spread algorithm vs a pack. |
Oh. Indeed, that would be really nice. It's extremely nice to have, if you're using soft limits. |
@jippi that's still job-(and further-)level, right? "Spreading" as an opposite to "bin packing". |
it is per job, yep, but if all your jobs do spread, it would basically be the same thing at least unblcok your requirements until nomad might support it on the cluster level. Generally in my experience nomad is pretty good out of the box to not dump 40 of the same alloc on the same box, so not something I've personally have suffered issues from |
Not the same alloc. But if you have 150+jobs with soft limits — you WILL have overloaded and almost free nodes, because of binpacking. |
@ramm Spread at the cluster level is in a future roadmap, no ETA on that yet. We would need to introduce configuring that at a node class level so that you can have a cluster where a set of nodes are using spread rather than binpacking for scoring placement. |
We're very interested in this feature as well. We thought the new spread parameter would do exactly this and are very disappointed that it doesn't. Here's a graph of several dispatched jobs running on a 4-node cluster, each node represented with a different color. The job is spread by node.unique.name, which as you can tell has no effect whatsoever: starting at ~9:30 all jobs are scheduled on the same node, leaving the other 3 nodes totally idle. |
Hi @preetapan, still no ETA on this? |
@alexiri, we just spent close to a week struggling to understand exactly this behavior. A cluster-wide spread is required for efficient oversubscription of resources for unary task specs in our use-case. Any ideas to exploit existing stanzas (like the datacenter string, perhaps), to get a spread effect (instead of bin-packing)? |
Spreading works when exploiting node_class affinity. E.g., if you have ten clients with node_class as 0,1,2...,9, then you can set an affinity for each job/group/task to be int(rand(9)) for better spread. Obviously, this is not ideal because your job submission mechanism now needs to be aware of the number of existing clients and their valid node_classes, but it works for our use-case for now. |
Hi @preetapan, has there been any progress on this? |
In Nomad 0.11.2 we released the new spread scheduling option. See #7810 and the |
I guess this feature was released in v0.12.0. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
We are currently in the process of adopting nomad and consul, and other than the following use-case that currently is, to the extent of our knowledge, hard or impossible to address based on the current Nomad semantics and available constraints/ranker impl., we think Nomad is "perfect", for us.
Specifically, we have various classes of services that we currently spread across a fleet of physical nodes (say, 50), based on which node runs the least number of instances. We use a combination of home-grown tools and scripts to accomplish this now, but we would like to transition this(like with everything else) to Nomad, but currently that's not possible, because the bin-packing algo. based ranker cannot accommodate that design.
One of those services is an multi-threaded Applications Server. It dynamically resizes the threads pool, and usually runs very hot (i.e keeps the CPU busy, and memory and I/O pressure is also high), which is what it is was designed to do, given that they handle potentially 1000s/reqs, each of such instances do. So by spreading them across the fleet of nodes, we get to optimally utilise them.
Bin-packing would have say, 5 instances on the first node, and then another on the next, until it gets upon 5 and then the next, and so on. Which is to say, nodes will be extremely busy and saturated while all other nodes will be idle, as opposed to just selecting, among those 50 nodes the ones that pass the constraint checks, just figure out which runs the least app.server instances, and running the job group there, which would pretty much solve the problem. (We have a few more other such use cases, this is not specific to the app.server ).
So, ideally, for us, there should be a(nother) constraint for maximum number of instances of a specific job allowed on a node (in case we need, among those 5 nodes, to reserve them capacity for something else and we only want to allow, say, only upto 2 instances of the "application server" job group on them ), and a new ranker stanza for selecting the spread ranker for that job(group?).
I understand a new ranker is currently in the works that would make his possible, and we are really looking forward to this.
Here's some pseudocode in C++ for how it could probably work:
The text was updated successfully, but these errors were encountered: