Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agreement needed]: add docs about the notebook node pool default choices #3304

Merged
merged 6 commits into from
Oct 25, 2023

Conversation

GeorgianaElena
Copy link
Member

@GeorgianaElena GeorgianaElena commented Oct 23, 2023

For #3256. In this issue there is more info about the motivation behind this.

Action points

Future work planned

Once this PR is merged, the plan is to:

  • Update the remaining clusters to make these three choices available to them, but without migrating the ones using a different type. This is something that is out of scope for now, but planned to happen as part of node sharing addoption
  • I would like to also create some user-facing documentation about this topic and others in the https://infrastructure.2i2c.org/topic/infrastructure/cluster-design/
  • [ ]

- n2-highmem-16
- n2-highmem-64
- [EKS](https://aws.amazon.com/ec2/instance-types/r5/)
- r5.xlarge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having this be a set number is a very good idea!

I made this dashboard just now to look at memory utilization % across our clusters: https://grafana.pilot.2i2c.cloud/d/ed8d55b8-54c7-4658-bea0-f9659a9b7c33/global-resource-usage?orgId=1&from=now-30d&to=now

This is actual amount of memory being used on all nodes. Utilization is consistently pretty low, with some small exceptions. The second row only shows nodes that have had 50+% utilization at any point - and you can see that it's almost empty. And this costs a lot of money. Based on that dashboard, plus the experience of cost reduction that happened in openscapes when we moved to a lower set of nodes, my suggestion here is to instead use:

  • r5.xlarge
  • r5.2xlarge
  • r5.8xlarge

(and similar equivalent for the other cloud providers).

I think based on actual usage, these are a better fit. r5.8xlarge is already 256GB, and I think that's quite a lot. There are going to be a few communities (like JMTE) that will want more, and they can be handled separately. In addition, on AWS the number of pods on a node is also smaller (a spot check showed me 58, although it probably varies) compared to GKE (at about 100).

So,

  1. Strong yes for just picking a set of node sizes
  2. Strong yes for those to be the 'memory intense' node types (r5, n2-highmem)
  3. Based on existing data, we should pick smaller sizes.

Thanks for working on these!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we could also use a graph that is a histogram of 'how much memory requests did people actually make?' and use that along with max pods to make a more informed decision. However I spent a bunch of time trying to get that to work on Grafana and couldn't. Given that, and the openscapes experience and the current dataset, I suggest we go with the size reduction I have proposed here and reconsider in the future if needed.

Copy link
Contributor

@consideRatio consideRatio Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice of 4 16 64 (set A) or 4 8 32 (set B)

Observations:

  • Set A is evenly distributed and has a wider span, while set B has higher resolution at the smaller sizes.
  • Set A is the current default in our terraform templates and many clusters already have set A configured

"Based on existing data, we should pick smaller sizes." doesn't seem to follow when I think about it. I reason like this:

  1. Set A and B alike have the 4 CPU / 32 GB node as the smallest size and can handle most resource allocation requests. Larger sizes is mostly something I see as a way to optimize a) startup times, and b) node utilization for CPU or memory resources for which requests are lower than the limit as that can enable a higher utilization with less risk of running out of the resource.
  2. A 4x multiplier between instance sizes seems sufficient for the resolution when choosing how many users we enable to schedule per node on average.
  3. The very large 64 CPU node (and 32 CPU) are mostly going to be used during events:
    1. There have been events with ~8 GB or ~16 GB of memory is requested, a 64 node would fit 64 or 32 users, which could be reasonable "users per node" choices. 32 / 16 users per node is probably fine as well though in many situations.
    2. Anyone pre-starting a 64 node could pre-start / pre-image-pull for a double the users to minimize delays without having us pre-start + pre-image-pull manually.

I'm currently favoring set A the most, and favoring set C of 4 8 16 64 or over set B I think. Set C would retain the wider span of set A of 4-64, provide highest resolution for the smallest instance size choices, but increment the choices from 3 to 4. With a history of adopting set A for some new clusters, set C overlaps better with set A as well.

@2i2c-org/engineering how do you rank these instance size combinations that we consider to ensure is available
among all our clusters?

  • A: 4 16 64
  • B: 4 8 32
  • C: 4 8 16 64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make things clearer, I made a third row in https://grafana.pilot.2i2c.cloud/d/ed8d55b8-54c7-4658-bea0-f9659a9b7c33/global-resource-usage?orgId=1&var-PROMETHEUS_DS=All, which shows only nodes that have had 256GB+ memory usage even once. You'll see that it's completely empty over the last 30 days - we have had 0 nodes that have ever used more than 256GB of RAM. And one single cluster had used that in the last 90 days for one day over 3 nodes.

So yes, I'm advocating for much smaller instances, because our current instances are way too big and costing our users a lot of money.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made another row, now for nodes that have used over 128GiB. The only cluster where that has happened is the same cluster that had gone over 256GB of RAM for a day, and nowhere else (over 90 days). This makes me conclude that even 256GB is too big, and we should be even smaller. I propose the equivalent of:

  • r5.xlarge
  • r5.2xlarge
  • r5.4xlarge
  • r5.8xlarge (optionally present always, but no profileList actually deploys here by default. We can offer if needed. By being present by default, engineers don't have to touch eksctl / terraform when this request comes in).

Anything else should be considered 'large nodes' and be deployed on user requests, not provided by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also explored memory requests and so there's another row in grafana now of nodes with memory requests that exceeded 256GiB.

Over the last 6 months, other than in two clusters that consistently have terabyte sized requests (probably JMTE and carbonplan with their x instances?), I count 6 (or 9) separate instances where there has existed a node with 256+ GB of memory requests. There's also one that had about 7T of memory requests that I'm sure is some kinda error.

I added another row with nodes that have more than 128GiB of requests. There is definitely more here, but it's very periodic and only has occured even once in 9 of our 28 clusters. To me, this continues to present a strong case for provisioning: 4 8 16, and 32 provisioned but not used by default (only used during events). I'm alright with also provisioning 64 but not using it by default, and only using it for events under specific conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess my concrete proposal is:

On Terraform / eksctl

Provisioning nodes here has no cost, and helps engineers make adjustments to profileList more easily with less toil. So it's ok to provision nodegroups here that aren't actively in use via profileList. So we provision:

On GCP

  1. n2-highmem-4
  2. n2-highmem-8
  3. n2-highmem-16
  4. n2-highmem-32

On AWS

  1. r5.xlarge
  2. r5.2xlarge
  3. r5.4xlarge
  4. r5.8xlarge

On Azure
(equivalent memory sizes, as their website is far too confusing)

Default ProfileList

This is more important for actual usage and UX, and from what I can tell, this PR doesn't actually say anything about profileList at all. Which is perfectly fine! I can provide similar feedback whenever that is being worked on, and we needn't concern ourselves with that in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel heard for what I've said @yuvipanda, maybe you don' either. I think we wont either if this discussion is continued async. Can we chat about this sync a bit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry you don't feel heard, @consideRatio. I don't either :( I'll reach out on slack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a nice conversation on slack and #3304 (review) is a good way forward. We can come back to #3307 later.

@@ -89,6 +91,28 @@ that `prometheus-server` may require more memory than is available.
On EKS we always use the `r5.xlarge` nodes to avoid running low on allocatable
pods.

#### For notebook node pool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of "notebook" is less clear than "user server" or "jupyter server", and mostly motivated for legacy reasons when https://github.com/jupyter/notebook was the main jupyter server around.

Can we rework this to "user server"?

Copy link
Member

@sgibson91 sgibson91 Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I totally agree with you regarding this language in user-facing docs, these are engineering facing docs referring to node pools, and the node pool is literally called "notebook" right now.

So we should either have language that is consistent with our code (i.e., revert back to saying "notebook node pool") or update our code to deploy node pools called "user-server" (which will be a lot of work, but doesn't mean we shouldn't do it necessarily). Third option as interim solution: keep the heading as "user server node pool" and add a callout explaining that in our infrastructure, these are called "notebook node pools".

resource "google_container_node_pool" "notebook" {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to avoid a situation where an engineer is confused by looking for a node pool in a cloud console or our terraform config called "user server" and it's because it is called "notebook" instead. I wouldn't oppose a proposal where we update our terraform to use "user server" instead. But if we merge docs now, they should be consistent with what exists now since we wouldn't get around to doing that renaming work for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we stick to just speaking about notebook node pools, we need to explain we mean that these are user servers anyhow in the end. So going with that straight up is a plus, but if we retain use of notebook somewhere we need to clarify it still in the end no matter what.

To change resource "google_container_node_pool" "notebook" { in terraform, we need to recreate things - right? I'll check... If we need to re-create things in order to rename that, I think we should just stick with that behind the scenes and comment "notebook" refers to "user server".

Copy link
Contributor

@consideRatio consideRatio Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing resource "google_container_node_pool" "notebook" { prompted a recreation of all nodes, while changing variable names like notebook_nodes didn't - but caused some node version upgrades.

I'm open to settling for anything, but I don't think we can find capacity to re-create all node pools for a naming update in terraform config. With that in mind, I lean towards comrpomises where we still steer towards speaking about "user servers" primarily, and adding callouts or comments about its mentioned as "notebook nodes" behind the scenes sometimes.

Copy link
Member Author

@GeorgianaElena GeorgianaElena Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgibson91, @consideRatio, I just added a new commit, that tries to clarify naming of the actual k8s resource vs what they mean as a concept. Hope this solves both your concerns. LTMK what you think.

docs/topic/infrastructure/cluster-design.md Outdated Show resolved Hide resolved
- n2-highmem-16
- n2-highmem-64
- [EKS](https://aws.amazon.com/ec2/instance-types/r5/)
- r5.xlarge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel heard for what I've said @yuvipanda, maybe you don' either. I think we wont either if this discussion is continued async. Can we chat about this sync a bit?

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there's differing opinions in https://github.com/2i2c-org/infrastructure/pull/3304/files#r1368816014, I think the way forward is for me to actually just approve this given my time constraints as it's a strict improvement as is over what exists. We can discuss reducing the node size separately, and perhaps that can be more organic when profileLists are discussed as well. As such, I'm approving these changes so I don't block this work.

I generally overall have a high level of trust in @consideRatio and @GeorgianaElena as well to get this right, even if it won't exactly be what I would do :) So am happy to come back to this later and not stop their momentum.

@GeorgianaElena
Copy link
Member Author

Note that I plan to merge this at the end of my workday, which is in ~2h if there aren't any objections 🚀

docs/topic/infrastructure/cluster-design.md Outdated Show resolved Hide resolved
docs/topic/infrastructure/cluster-design.md Outdated Show resolved Hide resolved
GeorgianaElena and others added 2 commits October 25, 2023 19:04
Co-authored-by: Sarah Gibson <[email protected]>
Co-authored-by: Sarah Gibson <[email protected]>
@GeorgianaElena
Copy link
Member Author

Thank you for catching all those typos @sgibson91 🚀 Merging this in a sec

@GeorgianaElena GeorgianaElena merged commit ca9af73 into 2i2c-org:master Oct 25, 2023
@GeorgianaElena GeorgianaElena deleted the document-instance-type branch October 25, 2023 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

4 participants