-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Agreement needed]: add docs about the notebook node pool default choices #3304
[Agreement needed]: add docs about the notebook node pool default choices #3304
Conversation
- n2-highmem-16 | ||
- n2-highmem-64 | ||
- [EKS](https://aws.amazon.com/ec2/instance-types/r5/) | ||
- r5.xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having this be a set number is a very good idea!
I made this dashboard just now to look at memory utilization % across our clusters: https://grafana.pilot.2i2c.cloud/d/ed8d55b8-54c7-4658-bea0-f9659a9b7c33/global-resource-usage?orgId=1&from=now-30d&to=now
This is actual amount of memory being used on all nodes. Utilization is consistently pretty low, with some small exceptions. The second row only shows nodes that have had 50+% utilization at any point - and you can see that it's almost empty. And this costs a lot of money. Based on that dashboard, plus the experience of cost reduction that happened in openscapes when we moved to a lower set of nodes, my suggestion here is to instead use:
- r5.xlarge
- r5.2xlarge
- r5.8xlarge
(and similar equivalent for the other cloud providers).
I think based on actual usage, these are a better fit. r5.8xlarge is already 256GB, and I think that's quite a lot. There are going to be a few communities (like JMTE) that will want more, and they can be handled separately. In addition, on AWS the number of pods on a node is also smaller (a spot check showed me 58, although it probably varies) compared to GKE (at about 100).
So,
- Strong yes for just picking a set of node sizes
- Strong yes for those to be the 'memory intense' node types (r5, n2-highmem)
- Based on existing data, we should pick smaller sizes.
Thanks for working on these!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we could also use a graph that is a histogram of 'how much memory requests did people actually make?' and use that along with max pods to make a more informed decision. However I spent a bunch of time trying to get that to work on Grafana and couldn't. Given that, and the openscapes experience and the current dataset, I suggest we go with the size reduction I have proposed here and reconsider in the future if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choice of 4 16 64 (set A) or 4 8 32 (set B)
Observations:
- Set A is evenly distributed and has a wider span, while set B has higher resolution at the smaller sizes.
- Set A is the current default in our terraform templates and many clusters already have set A configured
"Based on existing data, we should pick smaller sizes." doesn't seem to follow when I think about it. I reason like this:
- Set A and B alike have the 4 CPU / 32 GB node as the smallest size and can handle most resource allocation requests. Larger sizes is mostly something I see as a way to optimize a) startup times, and b) node utilization for CPU or memory resources for which requests are lower than the limit as that can enable a higher utilization with less risk of running out of the resource.
- A 4x multiplier between instance sizes seems sufficient for the resolution when choosing how many users we enable to schedule per node on average.
- The very large 64 CPU node (and 32 CPU) are mostly going to be used during events:
- There have been events with ~8 GB or ~16 GB of memory is requested, a 64 node would fit 64 or 32 users, which could be reasonable "users per node" choices. 32 / 16 users per node is probably fine as well though in many situations.
- Anyone pre-starting a 64 node could pre-start / pre-image-pull for a double the users to minimize delays without having us pre-start + pre-image-pull manually.
I'm currently favoring set A the most, and favoring set C of 4 8 16 64 or over set B I think. Set C would retain the wider span of set A of 4-64, provide highest resolution for the smallest instance size choices, but increment the choices from 3 to 4. With a history of adopting set A for some new clusters, set C overlaps better with set A as well.
@2i2c-org/engineering how do you rank these instance size combinations that we consider to ensure is available
among all our clusters?
- A: 4 16 64
- B: 4 8 32
- C: 4 8 16 64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make things clearer, I made a third row in https://grafana.pilot.2i2c.cloud/d/ed8d55b8-54c7-4658-bea0-f9659a9b7c33/global-resource-usage?orgId=1&var-PROMETHEUS_DS=All, which shows only nodes that have had 256GB+ memory usage even once. You'll see that it's completely empty over the last 30 days - we have had 0 nodes that have ever used more than 256GB of RAM. And one single cluster had used that in the last 90 days for one day over 3 nodes.
So yes, I'm advocating for much smaller instances, because our current instances are way too big and costing our users a lot of money.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I made another row, now for nodes that have used over 128GiB. The only cluster where that has happened is the same cluster that had gone over 256GB of RAM for a day, and nowhere else (over 90 days). This makes me conclude that even 256GB is too big, and we should be even smaller. I propose the equivalent of:
- r5.xlarge
- r5.2xlarge
- r5.4xlarge
- r5.8xlarge (optionally present always, but no profileList actually deploys here by default. We can offer if needed. By being present by default, engineers don't have to touch eksctl / terraform when this request comes in).
Anything else should be considered 'large nodes' and be deployed on user requests, not provided by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also explored memory requests and so there's another row in grafana now of nodes with memory requests that exceeded 256GiB.
Over the last 6 months, other than in two clusters that consistently have terabyte sized requests (probably JMTE and carbonplan with their x instances?), I count 6 (or 9) separate instances where there has existed a node with 256+ GB of memory requests. There's also one that had about 7T of memory requests that I'm sure is some kinda error.
I added another row with nodes that have more than 128GiB of requests. There is definitely more here, but it's very periodic and only has occured even once in 9 of our 28 clusters. To me, this continues to present a strong case for provisioning: 4 8 16, and 32 provisioned but not used by default (only used during events). I'm alright with also provisioning 64 but not using it by default, and only using it for events under specific conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess my concrete proposal is:
On Terraform / eksctl
Provisioning nodes here has no cost, and helps engineers make adjustments to profileList more easily with less toil. So it's ok to provision nodegroups here that aren't actively in use via profileList. So we provision:
On GCP
- n2-highmem-4
- n2-highmem-8
- n2-highmem-16
- n2-highmem-32
On AWS
- r5.xlarge
- r5.2xlarge
- r5.4xlarge
- r5.8xlarge
On Azure
(equivalent memory sizes, as their website is far too confusing)
Default ProfileList
This is more important for actual usage and UX, and from what I can tell, this PR doesn't actually say anything about profileList at all. Which is perfectly fine! I can provide similar feedback whenever that is being worked on, and we needn't concern ourselves with that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel heard for what I've said @yuvipanda, maybe you don' either. I think we wont either if this discussion is continued async. Can we chat about this sync a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry you don't feel heard, @consideRatio. I don't either :( I'll reach out on slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a nice conversation on slack and #3304 (review) is a good way forward. We can come back to #3307 later.
@@ -89,6 +91,28 @@ that `prometheus-server` may require more memory than is available. | |||
On EKS we always use the `r5.xlarge` nodes to avoid running low on allocatable | |||
pods. | |||
|
|||
#### For notebook node pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of "notebook" is less clear than "user server" or "jupyter server", and mostly motivated for legacy reasons when https://github.com/jupyter/notebook was the main jupyter server around.
Can we rework this to "user server"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I totally agree with you regarding this language in user-facing docs, these are engineering facing docs referring to node pools, and the node pool is literally called "notebook" right now.
So we should either have language that is consistent with our code (i.e., revert back to saying "notebook node pool") or update our code to deploy node pools called "user-server" (which will be a lot of work, but doesn't mean we shouldn't do it necessarily). Third option as interim solution: keep the heading as "user server node pool" and add a callout explaining that in our infrastructure, these are called "notebook node pools".
infrastructure/terraform/gcp/cluster.tf
Line 243 in d4224ce
resource "google_container_node_pool" "notebook" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to avoid a situation where an engineer is confused by looking for a node pool in a cloud console or our terraform config called "user server" and it's because it is called "notebook" instead. I wouldn't oppose a proposal where we update our terraform to use "user server" instead. But if we merge docs now, they should be consistent with what exists now since we wouldn't get around to doing that renaming work for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we stick to just speaking about notebook node pools, we need to explain we mean that these are user servers anyhow in the end. So going with that straight up is a plus, but if we retain use of notebook somewhere we need to clarify it still in the end no matter what.
To change resource "google_container_node_pool" "notebook" {
in terraform, we need to recreate things - right? I'll check... If we need to re-create things in order to rename that, I think we should just stick with that behind the scenes and comment "notebook" refers to "user server".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing resource "google_container_node_pool" "notebook" {
prompted a recreation of all nodes, while changing variable names like notebook_nodes
didn't - but caused some node version upgrades.
I'm open to settling for anything, but I don't think we can find capacity to re-create all node pools for a naming update in terraform config. With that in mind, I lean towards comrpomises where we still steer towards speaking about "user servers" primarily, and adding callouts or comments about its mentioned as "notebook nodes" behind the scenes sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgibson91, @consideRatio, I just added a new commit, that tries to clarify naming of the actual k8s resource vs what they mean as a concept. Hope this solves both your concerns. LTMK what you think.
- n2-highmem-16 | ||
- n2-highmem-64 | ||
- [EKS](https://aws.amazon.com/ec2/instance-types/r5/) | ||
- r5.xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel heard for what I've said @yuvipanda, maybe you don' either. I think we wont either if this discussion is continued async. Can we chat about this sync a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there's differing opinions in https://github.com/2i2c-org/infrastructure/pull/3304/files#r1368816014, I think the way forward is for me to actually just approve this given my time constraints as it's a strict improvement as is over what exists. We can discuss reducing the node size separately, and perhaps that can be more organic when profileLists are discussed as well. As such, I'm approving these changes so I don't block this work.
I generally overall have a high level of trust in @consideRatio and @GeorgianaElena as well to get this right, even if it won't exactly be what I would do :) So am happy to come back to this later and not stop their momentum.
Co-authored-by: Erik Sundell <[email protected]>
Co-authored-by: Erik Sundell <[email protected]>
Note that I plan to merge this at the end of my workday, which is in ~2h if there aren't any objections 🚀 |
Co-authored-by: Sarah Gibson <[email protected]>
Co-authored-by: Sarah Gibson <[email protected]>
Thank you for catching all those typos @sgibson91 🚀 Merging this in a sec |
For #3256. In this issue there is more info about the motivation behind this.
Action points
Future work planned
Once this PR is merged, the plan is to: