Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oversubscription support #606

Closed
c4milo opened this issue Dec 18, 2015 · 41 comments
Closed

Oversubscription support #606

c4milo opened this issue Dec 18, 2015 · 41 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/core type/enhancement

Comments

@c4milo
Copy link
Contributor

c4milo commented Dec 18, 2015

I noticed scheduling decisions are being made based on what jobs are reporting as desired capacity. However, some tasks involved might not really use what they originally intended to or will become idle thus holding back resources that could be used by other tasks. Are there plans to improve this in the short to mid term?

@dadgar
Copy link
Contributor

dadgar commented Dec 18, 2015

It is definitely something we are aware of and will be doing with Nomad. However there are many more pressing improvements that it is not something we will be focusing on in the near term.

@c4milo c4milo changed the title Are there any plans to measure real resource usage, optimizing bin packing to allow over provisioning? Are there any plans to measure real resource usage, optimizing bin packing to allow oversubscription? Dec 18, 2015
@c4milo c4milo changed the title Are there any plans to measure real resource usage, optimizing bin packing to allow oversubscription? Oversubscription support Dec 18, 2015
@cbednarski
Copy link
Contributor

Thanks for the link. There are some tricky aspects with implementation, not least of which is applications running on the platform need to be aware of resource reclamation (for example, by observing memory pressure). In practice this is a complicated thing to implement.

In light of that there are probably a few different approaches to this, in no particular order (and I'm not sure how these work at all on non-linux kernels):

  1. "reclaim" memory that is never used by an application. For example if a job asks for 2gb but only uses 512mb, automatically resize the job.
  2. Provide a dynamic metadata-like endpoint so jobs can query their updated resource quotas.
  3. Mark jobs as "resize aware" to indicate they will respond to memory pressure.

Memory is probably the most complicated because it is a hard resource limit. For soft limits like CPU and Network it's fairly easy to over-provision without crashing processes, but it's more difficult to provide QOS or consistent / knowable behavior across the cluster.

In general this problem is not easy to implement or reason about without a lot of extra data that is out currently of Nomad's scope. For example we would need to look at historical resource usage for a particular application in order to resize it appropriately and differentiate between real usage vs. memory leaks and such, monitor the impact of resizing on application health (e.g. does resizing it cause it to crash more frequently), etc.

So while this is something we'd like to do, and we're aware of the value it represents, this is likely not something we will be able to get to in the short / medium term.

@c4milo
Copy link
Contributor Author

c4milo commented Dec 21, 2015

@cbednarski, indeed, it is a complex feature. I believe it can be implemented to some extent, though. This is the list of rough tasks I made while going through the paper, it is not by any means complete:

  • It should allow defining Service Level Objetives (SLO) for a given task in a job definition.
  • It should allow launching a task using a Best Effort (BE) policy, meaning having some way to tag the task as being BE in the job definition. Which will also mean the task is interruptible.
  • It should continuously monitor latency and latency slack of scheduled Latency Critical (LC) tasks, in order to determine whether or not a node is suitable for oversubscription.
  • It should be able to isolate Latency Critical tasks from Best Effort ones by pinning LC tasks to specific CPU cores and specific CPU cache partitions.
  • It should monitor memory bandwidth using performance counters, making sure LC tasks receive sufficient bandwidth.
  • It should scale down the number of CPU cores assigned to a BE task if memory bandwidth for co-located LC tasks is not sufficient.
  • It should limit outgoing network traffic for BE tasks
  • It should not limit LC tasks network traffic
  • It should guarantee BE tasks power consumption does not cause CPU frequencies for LC tasks to scale down. In other words, it has to guarantee LC task's desired CPU frequencies are honor at all times.

Finally, some personal remarks:

  • I don't think BE tasks need to declare desired hardware resources as they will be run as that, best effort tasks. If a node has available resources and latency of its current LC tasks is fine, it will be suitable for running BE tasks.
  • There won't be need to "reclaim" allocated resources as long as BE tasks are allowed to be scheduled on nodes suitable for oversubscription.
  • The above is a rough list of tasks as I mentioned, more details and hints about the implementation can be found directly in the paper.
  • Agreed with you on that for non-linux nodes we will have to find out how to get performance counters information, limit network traffic, pin tasks to specific CPU cores, etc.
  • Perhaps the most difficult part of all this work may be its evaluation and testing.

@diptanu
Copy link
Contributor

diptanu commented Dec 21, 2015

My thoughts around oversubscription are -

a. We need guaranteed QoS guarantees for a Task. Tasks which are guaranteed certain resources should get them whenever they need it. Trying to add and take away resources for the same task doesn't work well in all cases especially if it's memory resources. CPU and I/O probably can be tuned up and down unless the task has very burty resource usages.

b. What works well however is the concept that certain jobs which can be revocable are oversubscribed along with jobs which are are not revocalble. And we estimate how much we can oversubscribe and run some revocable jobs when capacity is available and revoke them when the jobs which have been guaranteed the capacity needs them.

@doherty
Copy link

doherty commented Jul 24, 2016

You may be interested in these videos about Google's Borg, which include a discussion of how oversubscription is handled with SLOs:

@jippi
Copy link
Contributor

jippi commented Oct 25, 2016

Any ETA or implementation details on this one? :)

@dadgar
Copy link
Contributor

dadgar commented Oct 25, 2016

No, this feature is very far down the line

@catamphetamine
Copy link

catamphetamine commented Jul 1, 2017

I don't understand: if I just use Docker to run containers it doesn't impose any restrictions or reservations on CPU or memory. This software, in contrast, requires the user to do so. If that's the case then I guess I'll just use Docker. And I don't need the clumsy "driver" concept just to support all those "vagrant" things afloat no one really needs in modern microservice architectures.
Procrustean bed this is called.

@dadgar
Copy link
Contributor

dadgar commented Jul 1, 2017

@halt-hammerzeit There are very different requirements for a cluster scheduler and a local container runtime. Nomad imposes resource constraints so it can bin-pack nodes and provide a certain quality of service for tasks it places.

When you don't place any resource constraints using Docker locally, you get no guarantee that container won't utilize all the CPU/Memory on your system and that is not suitable for a cluster scheduler! Hope that helps!

@jzvelc
Copy link

jzvelc commented Jul 5, 2017

@dadgar Certainty isn't required in some cases. For example we are currently running around 30 QA environments and for that we need a lot of servers (each environment (job) needs around 2GB of memory in order to cover memory spikes). Utilization of those servers is very low and we can't work around those memory spikes (e.g. PHP Symfony2 apps require cache warm-up at startup which consumes 3 times the memory that is actually needed for runtime). I should be able to schedule those QA environments on a single server and I don't care if QoS is degraded since it's for testing purposes only. Scheduler should still make decisions based on task memory limit provided but we should be able to define soft limit and disable hard memory limit on docker containers. Something like #2771 would be great.
Other container platforms such as ECS and Kubernetes handle this just fine.

@tshak
Copy link

tshak commented Nov 10, 2017

I think it is difficult to ask developers to define resource allocations for services especially for new services or when a service is running in a wide range of environments. I understand that Nomad's approach greatly simplifies bin packing, but this does us little good if we aren't good at predicting the required resources. One reason this is particularly challenging for us is because we have 100's of different production environments (multiple tiers of multi-tenancy + lots of single tenants with a wide range of hardware and usage requirements). Even if we can generalize some of these configurations, I believe that explicit resource allocation could be an undesirable challenge for us to take on for each service.

Clearly there was a lot of thought put behind the current design. The relatively low priority on this issue also indicates to me a strong opinion that the current offering is at least acceptable if not desirable for a wide range of use cases. Maybe some guidance on configuring resource allocation, especially in cases where we lack a-priori knowledge of the load requirements of a service would be helpful.

Ultimately my goal is to provide developers a nearly "Cloud Foundry" like experience. "Here's my service, make it run, I don't care how". I really like Nomad's simplicity compared to other solutions like Kubernetes, but this particular issue could be an adoption blocker. I'm happy to discuss further or provide more detail about my particular scenario here or on Gitter.

@dadgar
Copy link
Contributor

dadgar commented Nov 10, 2017

@tshak I would recommend you over-allocate resources for new jobs and then monitor the actual resource usage and adjust the resource ask to be inline with your actual requirement.

@catamphetamine
Copy link

@tshak Or look into Kubernetes which seems to claim to support such a feature

@tshak
Copy link

tshak commented Nov 11, 2017

Thanks @dadgar. Unfortunately this is a high burden since we have many environments of varying sizes and workloads that these services run on. We've got a good discussion going on Gitter if you're interested in more. As @catamphetamine said, we may have to use Kubernetes. The concept of "Guarunteed" vs. "Burstable" vs. "BestEffort" seems to better fit our use case. I was hoping for Nomad to have (or plan for the near future) something similar since Nomad is otherwise preferable to me!

@catamphetamine
Copy link

I was too giving it a second thought yesterday, after abandoning Nomad in summer due to it lacking this feature.
Containers are meant to be "stateless" and "ephemeral" so if a container crashes due to an Out Of Memory error then ideally it would have no difference as the code should automatically retry the API query.
In the real world though there's no "auto retry" feature in any code so if an API request fails the whole transaction may be left in an inconsistent state possibly corrupting application's data.

@DanielFallon
Copy link

I think what kubernetes calls "burstable" is the most intractable use case here, many services that we run use a large heap during start up and then have relatively low memory usage after startup.

one of the services I've been monitoring requires as much as 4 gigs during startup to warm up its cache and then it typically runs around 750mb of ram during normal operation. With nomad I must allocate 4 gigs of ram for each of these microservices which is really expensive.

@CumpsD
Copy link

CumpsD commented Apr 16, 2018

Is it possible with Nomad at this time to support this burstable feature?

I have a job which consumes a lot of CPU when launching and then falls back to almost no CPU.

However, Nomad (using Docker driver) kills off this job before it can get over its initial peak.

I cannot allocate enough CPU to get this started or it doesnt find an allocation target

@schmichael
Copy link
Member

However, Nomad (using Docker driver) kills off this job before it can get over its initial peak.
-- @CumpsD

Nomad should not be killing a job due to its CPU usage. Just so you know: by default Docker/exec/java/rkt CPU limits are soft limits meaning they allow bursting CPU usage today. If the count is >1 you may want to use the distinct_hosts constraint on the initial run to make sure multiple instances aren't contending for resources on the same host, but beyond the initial run the deployments feature can prevent instances from starting at the same time during their warmup period.

While we still plan on first class oversubscription someday, CPU bursting should be beneficial for your use case today. Please open a new issue if you think there's a bug.

@CumpsD
Copy link

CumpsD commented Apr 17, 2018

@schmichael it was a wrong assumption, the cpu spike at start and the stopping were unrelated. After debugging for a day it turned out the inner process stopped, causing the job to stop and restart

Sorry for the confusion. After fixing it, it is obvious the limit is indeed a soft one :)

@shantanugadgil
Copy link
Contributor

Can this feature be prioritized?
I really (really really) need the 'memory over-subscription' for dockers.
I could do with a "-1" flag which basically mean "don't care" and launch the dockers with no memory constraints.

This could be an agent config setting, and thus not get enabled for all agents.

I have a few dockers (tasks) which get allocated to a single node (based on the node name constraint at job level) and I cannot assign small memory values to them as then they don't start at all.
Ref: https://www.nomadproject.io/docs/drivers/docker.html#memory

I thus end up launching a larger machine than really needed to satisfy the resource requirements.

I get the SLA bit and the need for specifying the resource requirements upfront, but the "-1" flag along with allowing swap is what would really solve this in an elegant way.

I have been contemplating launching the tasks on that single node using a docker-compose (yech!) as a raw_exec task.

At this point, I am even okay with flags like:
i_really_need_swap_enabled and ignore_memory_constraints_during_launch.
`

@jippi
Copy link
Contributor

jippi commented Dec 12, 2018

With Nomad 0.9 we can do plugins - I certainly plan to add a docker driver with exactly those features :)

@academiqnsu
Copy link

Can you tell status of this ticket?

@honnibal
Copy link

honnibal commented Sep 27, 2019

@shantanugadgil @academiqnsu

I think we have a workaround for this. You can tell Nomad how much memory to expect for each node in the agent config: https://www.nomadproject.io/docs/configuration/client.html . If you set this to some factor higher (e.g. 3x higher), you can then set all the memory limits for your jobs 3x higher as well. The placements will go through, because it will check against the resources described in the config...But the tasks won't be killed, because the physical memory use won't run into the 3x higher limit you've set.

I have to say, I think Nomad should probably rethink the policies around over-subscription, and maybe build a more easy affordance for this.

Most programmers work with memory managed languages these days, which makes it almost impossible to stick to a memory budget. There are tonnes of situations where you'll have a short period where you're even 5x over the normal memory use. Predicting these short memory bursts is very difficult just from reading the code, so you'll get a lot of eventual job failures if there's no over-subscription.

Serialization is a particularly common cause of memory spikes. If you've got most of your memory in some data structure and then you write that data structure to disk, you might create a byte string for the whole object. That's at least 2x usage right there. While constructing the string, maybe the serialization library creates some intermediate representation --- now you're at 3x. Etc. Dying at serialization like this is particularly nasty, because it means you'll very often hit a pattern where you run for a while, do a bunch of work, and die just as you're trying to save it.

Those serialization issues could happen in any language, but in languages like Python the total size that the process is using is particularly opaque. Python doesn't really give you guarantees about when objects will be freed, and it's really normal not to worry much about the fact that two copies might be temporarily around if you write something like thing = func(thing).

@shantanugadgil
Copy link
Contributor

@honnibal I had also reached a similar conclusion to fix the first part of the problem, i.e. to allow multiple tasks to actually start.

This works great for raw_exec, as memory constraints are considered only during allocation and not the actual run.

But, for Docker, using swap has been disabled in some ancient version (0.4, I think)

My ask of having swap, ensures that the docker tasks won't die if all of them really start allocating large amounts of memory.

@honnibal
Copy link

Just saw this issue: #6085

It sounds like docker does swap, currently, but unintentionally? This might complicate the analysis...But in the meantime if you install 0.9.4 it might work?

@shantanugadgil
Copy link
Contributor

Awesome find @honnibal !!!

@zackkain
Copy link

How have other people kept an eye on jobs running overallocated? Do you use the HTTP API? I can't quite get the information I want from the telemetry data.

@LordFPL
Copy link

LordFPL commented Mar 25, 2020

A little up for this need : i have many jobs who need cpu at start... but almost nothing after. Having possibility to overallocating cpu/mem can be really usefull.
I will try the solution with the agent config, but it's not really a solution.

@dvusboy
Copy link

dvusboy commented Mar 25, 2020

How would oversubscription help in this scenario?

@LordFPL
Copy link

LordFPL commented Mar 26, 2020

(sorry for late answer) In my case, i have this node for example :

Allocated Resources
CPU              Memory          Disk
26980/27600 MHz  67 GiB/157 GiB  6.4 GiB/1.6 TiB

Allocation Resource Utilization
CPU            Memory
674/27600 MHz  19 GiB/157 GiB

Host Resource Utilization
CPU             Memory          Disk
3752/27600 MHz  44 GiB/157 GiB  81 GiB/1.6 TiB

My host is full... but unused... due to several "too high" allocation :

for i in $(nomad node status -self -short | grep running | awk '{print $1}'); do nomad alloc status $i | grep MHz; done
17/4000 MHz  8.8 GiB/14 GiB  300 MiB  http: 10.2.200.138:23314
24/4000 MHz  1.3 GiB/18 GiB  300 MiB  http: 10.2.200.138:22872
27/4000 MHz  772 MiB/1.2 GiB  300 MiB  http: 10.2.200.138:22120
19/1000 MHz  819 MiB/1000 MiB  300 MiB  http: 10.2.200.138:23039
0/1000 MHz  1.1 MiB/600 MiB  300 MiB  ssh: 10.2.200.138:24484
5/2000 MHz  1.5 GiB/2.4 GiB  300 MiB  http: 10.2.200.138:30431
11/1000 MHz  791 MiB/1000 MiB  300 MiB  http: 10.2.200.138:27430
30/1000 MHz  632 MiB/4.0 GiB  300 MiB  http: 10.2.200.138:26262
10/1000 MHz  661 MiB/4.0 GiB  300 MiB  http: 10.2.200.138:22422
99/1000 MHz  903 MiB/1.0 GiB  300 MiB  http: 10.2.200.138:28412
0/200 MHz  10 MiB/1.5 GiB  300 MiB  http: 10.2.200.138:29148
0/1000 MHz  6.8 MiB/4.0 GiB  300 MiB  http: 10.2.200.138:25068
0/200 MHz  24 MiB/1.5 GiB  300 MiB  http: 10.2.200.138:27136
125/80 MHz  2.4 MiB/500 MiB  300 MiB  flexlm: 10.2.200.138:6200
26/1000 MHz  918 MiB/4.0 GiB  300 MiB  http: 10.2.200.138:27596
4/1000 MHz  85 MiB/4.0 GiB  300 MiB  http: 10.2.200.138:24475
11/1000 MHz  1.6 GiB/2.0 GiB  300 MiB  http: 10.2.200.138:30787
0/1000 MHz  1.0 MiB/600 MiB  300 MiB  ssh: 10.2.200.138:29025
0/200 MHz  1.4 MiB/1.5 GiB  300 MiB  http: 10.2.200.138:21228
0/1000 MHz  1.0 MiB/600 MiB  300 MiB  ssh: 10.2.200.138:37625
1/100 MHz  1.1 MiB/200 MiB  300 MiB
68/200 MHz  39 MiB/500 MiB  300 MiB

By default, i affect 1000 MHz... but with many jobs per node... i reach the limit.
With oversubscription, i may always run new jobs... and keep an eye on real usage (all 1000 MHz above will lowered in the future, but we always need time to watch real usage of each job).

@dvusboy
Copy link

dvusboy commented Mar 26, 2020

You can always lowball the CPU resource requirement on your tasks to achieve jamming more tasks onto a node. That's essentially, oversubscription for your use case. AFAIK, the resource > CPU does not explicitly place cgroup limitations on the tasks. You'd just have more processes fighting for CPU slices.

@shantanugadgil
Copy link
Contributor

Reducing cpu for docker doesn't work well in the common use case that the docker container startup is using that value for further internal assignments amd allocations. Example: java cmdline inside docker task to set the -X mem parameters

@dvusboy
Copy link

dvusboy commented Mar 26, 2020

Sorry, @shantanugadgil, there lies my beef with Java, or anything JVM, such a pain in resource management.
That aside, I agree, it may not work for all scenarios, but it's also important in this discussion to point out these holes so if anyone do take up the cause, will take into account of these cases.
It's probably just me, but I mistrust oversubscription, and I don't need another suspicion to lurk around when I need to troubleshoot things.

@shantanugadgil
Copy link
Contributor

In my opinion, the entire premise of 'no swap' and 'up front reservation' of CPU/memory is more of a typical PROD requirement, whereas dev/validation testing (non stress) can get away with a slower speeds.
In the spirit of keeping the job definition same for dev/prod, tweaking the agent values to allow more tasks to be scheduled on a node make sense to me.
If it were up to me, I would even have a "factor" of how much to tweak it per environment; x1 for PROD (i.e. no tweaking), x2 for QA and x3 for DEV, given that there would be some cfg mgmt system to create the agent config files. (detect memory and multiple by 3 😀 )

@jippi
Copy link
Contributor

jippi commented Mar 31, 2020

What we're working towards as workaround is to use the client -> memory_total_mb stanza to just artificially increase the memory on the instances to fake "over-subscription" :)

@shantanugadgil
Copy link
Contributor

What we're working towards as workaround is to use the client -> memory_total_mb stanza to just artificially increase the memory on the instances to fake "over-subscription" :)

yes, that is the "factor" I am talking about. Better to lie in the agent config than to starve the task of resources 😆

@shoenig shoenig self-assigned this May 28, 2020
@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Aug 24, 2020
@yishan-lin
Copy link
Contributor

On the roadmap - coming soon.

@kmohageri-blacksky
Copy link

Glad to hear this is on the roadmap. Any ETA available?

@tgross
Copy link
Member

tgross commented May 4, 2021

Memory oversubscription has shipped in Nomad 1.1.0-beta. See #10247

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/core type/enhancement
Projects
None yet
Development

No branches or pull requests