-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oversubscription support #606
Comments
It is definitely something we are aware of and will be doing with Nomad. However there are many more pressing improvements that it is not something we will be focusing on in the near term. |
Relevant: http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf About memory bandwidth isolation:
CPU cache allocation and memory bandwidth monitoring: |
Thanks for the link. There are some tricky aspects with implementation, not least of which is applications running on the platform need to be aware of resource reclamation (for example, by observing memory pressure). In practice this is a complicated thing to implement. In light of that there are probably a few different approaches to this, in no particular order (and I'm not sure how these work at all on non-linux kernels):
Memory is probably the most complicated because it is a hard resource limit. For soft limits like CPU and Network it's fairly easy to over-provision without crashing processes, but it's more difficult to provide QOS or consistent / knowable behavior across the cluster. In general this problem is not easy to implement or reason about without a lot of extra data that is out currently of Nomad's scope. For example we would need to look at historical resource usage for a particular application in order to resize it appropriately and differentiate between real usage vs. memory leaks and such, monitor the impact of resizing on application health (e.g. does resizing it cause it to crash more frequently), etc. So while this is something we'd like to do, and we're aware of the value it represents, this is likely not something we will be able to get to in the short / medium term. |
@cbednarski, indeed, it is a complex feature. I believe it can be implemented to some extent, though. This is the list of rough tasks I made while going through the paper, it is not by any means complete:
Finally, some personal remarks:
|
My thoughts around oversubscription are - a. We need guaranteed QoS guarantees for a Task. Tasks which are guaranteed certain resources should get them whenever they need it. Trying to add and take away resources for the same task doesn't work well in all cases especially if it's memory resources. CPU and I/O probably can be tuned up and down unless the task has very burty resource usages. b. What works well however is the concept that certain jobs which can be revocable are oversubscribed along with jobs which are are not revocalble. And we estimate how much we can oversubscribe and run some revocable jobs when capacity is available and revoke them when the jobs which have been guaranteed the capacity needs them. |
You may be interested in these videos about Google's Borg, which include a discussion of how oversubscription is handled with SLOs: |
Any ETA or implementation details on this one? :) |
No, this feature is very far down the line |
I don't understand: if I just use Docker to run containers it doesn't impose any restrictions or reservations on CPU or memory. This software, in contrast, requires the user to do so. If that's the case then I guess I'll just use Docker. And I don't need the clumsy "driver" concept just to support all those "vagrant" things afloat no one really needs in modern microservice architectures. |
@halt-hammerzeit There are very different requirements for a cluster scheduler and a local container runtime. Nomad imposes resource constraints so it can bin-pack nodes and provide a certain quality of service for tasks it places. When you don't place any resource constraints using Docker locally, you get no guarantee that container won't utilize all the CPU/Memory on your system and that is not suitable for a cluster scheduler! Hope that helps! |
@dadgar Certainty isn't required in some cases. For example we are currently running around 30 QA environments and for that we need a lot of servers (each environment (job) needs around 2GB of memory in order to cover memory spikes). Utilization of those servers is very low and we can't work around those memory spikes (e.g. PHP Symfony2 apps require cache warm-up at startup which consumes 3 times the memory that is actually needed for runtime). I should be able to schedule those QA environments on a single server and I don't care if QoS is degraded since it's for testing purposes only. Scheduler should still make decisions based on task memory limit provided but we should be able to define soft limit and disable hard memory limit on docker containers. Something like #2771 would be great. |
I think it is difficult to ask developers to define resource allocations for services especially for new services or when a service is running in a wide range of environments. I understand that Nomad's approach greatly simplifies bin packing, but this does us little good if we aren't good at predicting the required resources. One reason this is particularly challenging for us is because we have 100's of different production environments (multiple tiers of multi-tenancy + lots of single tenants with a wide range of hardware and usage requirements). Even if we can generalize some of these configurations, I believe that explicit resource allocation could be an undesirable challenge for us to take on for each service. Clearly there was a lot of thought put behind the current design. The relatively low priority on this issue also indicates to me a strong opinion that the current offering is at least acceptable if not desirable for a wide range of use cases. Maybe some guidance on configuring resource allocation, especially in cases where we lack a-priori knowledge of the load requirements of a service would be helpful. Ultimately my goal is to provide developers a nearly "Cloud Foundry" like experience. "Here's my service, make it run, I don't care how". I really like Nomad's simplicity compared to other solutions like Kubernetes, but this particular issue could be an adoption blocker. I'm happy to discuss further or provide more detail about my particular scenario here or on Gitter. |
@tshak I would recommend you over-allocate resources for new jobs and then monitor the actual resource usage and adjust the resource ask to be inline with your actual requirement. |
@tshak Or look into Kubernetes which seems to claim to support such a feature |
Thanks @dadgar. Unfortunately this is a high burden since we have many environments of varying sizes and workloads that these services run on. We've got a good discussion going on Gitter if you're interested in more. As @catamphetamine said, we may have to use Kubernetes. The concept of "Guarunteed" vs. "Burstable" vs. "BestEffort" seems to better fit our use case. I was hoping for Nomad to have (or plan for the near future) something similar since Nomad is otherwise preferable to me! |
I was too giving it a second thought yesterday, after abandoning Nomad in summer due to it lacking this feature. |
I think what kubernetes calls "burstable" is the most intractable use case here, many services that we run use a large heap during start up and then have relatively low memory usage after startup. one of the services I've been monitoring requires as much as 4 gigs during startup to warm up its cache and then it typically runs around 750mb of ram during normal operation. With nomad I must allocate 4 gigs of ram for each of these microservices which is really expensive. |
Is it possible with Nomad at this time to support this burstable feature? I have a job which consumes a lot of CPU when launching and then falls back to almost no CPU. However, Nomad (using Docker driver) kills off this job before it can get over its initial peak. I cannot allocate enough CPU to get this started or it doesnt find an allocation target |
Nomad should not be killing a job due to its CPU usage. Just so you know: by default Docker/exec/java/rkt CPU limits are soft limits meaning they allow bursting CPU usage today. If the count is >1 you may want to use the While we still plan on first class oversubscription someday, CPU bursting should be beneficial for your use case today. Please open a new issue if you think there's a bug. |
@schmichael it was a wrong assumption, the cpu spike at start and the stopping were unrelated. After debugging for a day it turned out the inner process stopped, causing the job to stop and restart Sorry for the confusion. After fixing it, it is obvious the limit is indeed a soft one :) |
Can this feature be prioritized? This could be an agent config setting, and thus not get enabled for all agents. I have a few dockers (tasks) which get allocated to a single node (based on the I thus end up launching a larger machine than really needed to satisfy the resource requirements. I get the SLA bit and the need for specifying the resource requirements upfront, but the "-1" flag along with allowing swap is what would really solve this in an elegant way. I have been contemplating launching the tasks on that single node using a At this point, I am even okay with flags like: |
With Nomad 0.9 we can do plugins - I certainly plan to add a docker driver with exactly those features :) |
Can you tell status of this ticket? |
I think we have a workaround for this. You can tell Nomad how much memory to expect for each node in the agent config: https://www.nomadproject.io/docs/configuration/client.html . If you set this to some factor higher (e.g. 3x higher), you can then set all the memory limits for your jobs 3x higher as well. The placements will go through, because it will check against the resources described in the config...But the tasks won't be killed, because the physical memory use won't run into the 3x higher limit you've set. I have to say, I think Nomad should probably rethink the policies around over-subscription, and maybe build a more easy affordance for this. Most programmers work with memory managed languages these days, which makes it almost impossible to stick to a memory budget. There are tonnes of situations where you'll have a short period where you're even 5x over the normal memory use. Predicting these short memory bursts is very difficult just from reading the code, so you'll get a lot of eventual job failures if there's no over-subscription. Serialization is a particularly common cause of memory spikes. If you've got most of your memory in some data structure and then you write that data structure to disk, you might create a byte string for the whole object. That's at least 2x usage right there. While constructing the string, maybe the serialization library creates some intermediate representation --- now you're at 3x. Etc. Dying at serialization like this is particularly nasty, because it means you'll very often hit a pattern where you run for a while, do a bunch of work, and die just as you're trying to save it. Those serialization issues could happen in any language, but in languages like Python the total size that the process is using is particularly opaque. Python doesn't really give you guarantees about when objects will be freed, and it's really normal not to worry much about the fact that two copies might be temporarily around if you write something like |
@honnibal I had also reached a similar conclusion to fix the first part of the problem, i.e. to allow multiple tasks to actually start. This works great for But, for Docker, using swap has been disabled in some ancient version (0.4, I think) My ask of having swap, ensures that the docker tasks won't die if all of them really start allocating large amounts of memory. |
Just saw this issue: #6085 It sounds like docker does swap, currently, but unintentionally? This might complicate the analysis...But in the meantime if you install 0.9.4 it might work? |
Awesome find @honnibal !!! |
How have other people kept an eye on jobs running overallocated? Do you use the HTTP API? I can't quite get the information I want from the telemetry data. |
A little up for this need : i have many jobs who need cpu at start... but almost nothing after. Having possibility to overallocating cpu/mem can be really usefull. |
How would oversubscription help in this scenario? |
(sorry for late answer) In my case, i have this node for example :
My host is full... but unused... due to several "too high" allocation :
By default, i affect 1000 MHz... but with many jobs per node... i reach the limit. |
You can always lowball the CPU resource requirement on your tasks to achieve jamming more tasks onto a node. That's essentially, oversubscription for your use case. AFAIK, the |
Reducing cpu for docker doesn't work well in the common use case that the docker container startup is using that value for further internal assignments amd allocations. Example: java cmdline inside docker task to set the -X mem parameters |
Sorry, @shantanugadgil, there lies my beef with Java, or anything JVM, such a pain in resource management. |
In my opinion, the entire premise of 'no swap' and 'up front reservation' of CPU/memory is more of a typical PROD requirement, whereas dev/validation testing (non stress) can get away with a slower speeds. |
What we're working towards as workaround is to use the client -> memory_total_mb stanza to just artificially increase the memory on the instances to fake "over-subscription" :) |
yes, that is the "factor" I am talking about. Better to lie in the agent config than to starve the task of resources 😆 |
On the roadmap - coming soon. |
Glad to hear this is on the roadmap. Any ETA available? |
Memory oversubscription has shipped in Nomad 1.1.0-beta. See #10247 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I noticed scheduling decisions are being made based on what jobs are reporting as desired capacity. However, some tasks involved might not really use what they originally intended to or will become idle thus holding back resources that could be used by other tasks. Are there plans to improve this in the short to mid term?
The text was updated successfully, but these errors were encountered: