[Agreement needed]: add docs about the notebook node pool default choices #3304

GeorgianaElena · 2023-10-23T13:53:34Z

For #3256. In this issue there is more info about the motivation behind this.

Action points

@2i2c-org/engineering, please express your concerns about this if any until Wednesday or approve this PR if you agree with the default policy of making these node choices available for each 2i2c cluster?
- @consideRatio
- @GeorgianaElena
- @yuvipanda (Iterate on what instance sizes we ensure is setup for all clusters #3307 was created based on Yuvi's feedback and plan to iterate this in the future)
- @sgibson91
- @damianavila

Future work planned

Once this PR is merged, the plan is to:

Update the remaining clusters to make these three choices available to them, but without migrating the ones using a different type. This is something that is out of scope for now, but planned to happen as part of node sharing addoption
I would like to also create some user-facing documentation about this topic and others in the https://infrastructure.2i2c.org/topic/infrastructure/cluster-design/
[ ]

yuvipanda · 2023-10-23T14:53:34Z

docs/topic/infrastructure/cluster-design.md

+  - n2-highmem-16
+  - n2-highmem-64
+- [EKS](https://aws.amazon.com/ec2/instance-types/r5/)
+  - r5.xlarge


I think having this be a set number is a very good idea!

I made this dashboard just now to look at memory utilization % across our clusters: https://grafana.pilot.2i2c.cloud/d/ed8d55b8-54c7-4658-bea0-f9659a9b7c33/global-resource-usage?orgId=1&from=now-30d&to=now

This is actual amount of memory being used on all nodes. Utilization is consistently pretty low, with some small exceptions. The second row only shows nodes that have had 50+% utilization at any point - and you can see that it's almost empty. And this costs a lot of money. Based on that dashboard, plus the experience of cost reduction that happened in openscapes when we moved to a lower set of nodes, my suggestion here is to instead use:

r5.xlarge

r5.2xlarge

r5.8xlarge

(and similar equivalent for the other cloud providers).

I think based on actual usage, these are a better fit. r5.8xlarge is already 256GB, and I think that's quite a lot. There are going to be a few communities (like JMTE) that will want more, and they can be handled separately. In addition, on AWS the number of pods on a node is also smaller (a spot check showed me 58, although it probably varies) compared to GKE (at about 100).

So,

Strong yes for just picking a set of node sizes

Strong yes for those to be the 'memory intense' node types (r5, n2-highmem)

Based on existing data, we should pick smaller sizes.

Thanks for working on these!

Ideally, we could also use a graph that is a histogram of 'how much memory requests did people actually make?' and use that along with max pods to make a more informed decision. However I spent a bunch of time trying to get that to work on Grafana and couldn't. Given that, and the openscapes experience and the current dataset, I suggest we go with the size reduction I have proposed here and reconsider in the future if needed.

Choice of 4 16 64 (set A) or 4 8 32 (set B)

Observations:

Set A is evenly distributed and has a wider span, while set B has higher resolution at the smaller sizes.

Set A is the current default in our terraform templates and many clusters already have set A configured

"Based on existing data, we should pick smaller sizes." doesn't seem to follow when I think about it. I reason like this:

Set A and B alike have the 4 CPU / 32 GB node as the smallest size and can handle most resource allocation requests. Larger sizes is mostly something I see as a way to optimize a) startup times, and b) node utilization for CPU or memory resources for which requests are lower than the limit as that can enable a higher utilization with less risk of running out of the resource.

A 4x multiplier between instance sizes seems sufficient for the resolution when choosing how many users we enable to schedule per node on average.

The very large 64 CPU node (and 32 CPU) are mostly going to be used during events:

There have been events with ~8 GB or ~16 GB of memory is requested, a 64 node would fit 64 or 32 users, which could be reasonable "users per node" choices. 32 / 16 users per node is probably fine as well though in many situations.

Anyone pre-starting a 64 node could pre-start / pre-image-pull for a double the users to minimize delays without having us pre-start + pre-image-pull manually.

I'm currently favoring set A the most, and favoring set C of 4 8 16 64 or over set B I think. Set C would retain the wider span of set A of 4-64, provide highest resolution for the smallest instance size choices, but increment the choices from 3 to 4. With a history of adopting set A for some new clusters, set C overlaps better with set A as well.

@2i2c-org/engineering how do you rank these instance size combinations that we consider to ensure is available
among all our clusters?

A: 4 16 64

B: 4 8 32

C: 4 8 16 64

To make things clearer, I made a third row in https://grafana.pilot.2i2c.cloud/d/ed8d55b8-54c7-4658-bea0-f9659a9b7c33/global-resource-usage?orgId=1&var-PROMETHEUS_DS=All, which shows only nodes that have had 256GB+ memory usage even once. You'll see that it's completely empty over the last 30 days - we have had 0 nodes that have ever used more than 256GB of RAM. And one single cluster had used that in the last 90 days for one day over 3 nodes.

So yes, I'm advocating for much smaller instances, because our current instances are way too big and costing our users a lot of money.

Ok, I made another row, now for nodes that have used over 128GiB. The only cluster where that has happened is the same cluster that had gone over 256GB of RAM for a day, and nowhere else (over 90 days). This makes me conclude that even 256GB is too big, and we should be even smaller. I propose the equivalent of:

r5.xlarge

r5.2xlarge

r5.4xlarge

r5.8xlarge (optionally present always, but no profileList actually deploys here by default. We can offer if needed. By being present by default, engineers don't have to touch eksctl / terraform when this request comes in).

Anything else should be considered 'large nodes' and be deployed on user requests, not provided by default.

I also explored memory requests and so there's another row in grafana now of nodes with memory requests that exceeded 256GiB.

Over the last 6 months, other than in two clusters that consistently have terabyte sized requests (probably JMTE and carbonplan with their x instances?), I count 6 (or 9) separate instances where there has existed a node with 256+ GB of memory requests. There's also one that had about 7T of memory requests that I'm sure is some kinda error.

I added another row with nodes that have more than 128GiB of requests. There is definitely more here, but it's very periodic and only has occured even once in 9 of our 28 clusters. To me, this continues to present a strong case for provisioning: 4 8 16, and 32 provisioned but not used by default (only used during events). I'm alright with also provisioning 64 but not using it by default, and only using it for events under specific conditions.

So I guess my concrete proposal is:

On Terraform / eksctl

Provisioning nodes here has no cost, and helps engineers make adjustments to profileList more easily with less toil. So it's ok to provision nodegroups here that aren't actively in use via profileList. So we provision:

On GCP

n2-highmem-4

n2-highmem-8

n2-highmem-16

n2-highmem-32

On AWS

r5.xlarge

r5.2xlarge

r5.4xlarge

r5.8xlarge

On Azure
(equivalent memory sizes, as their website is far too confusing)

Default ProfileList

This is more important for actual usage and UX, and from what I can tell, this PR doesn't actually say anything about profileList at all. Which is perfectly fine! I can provide similar feedback whenever that is being worked on, and we needn't concern ourselves with that in this PR.

I don't feel heard for what I've said @yuvipanda, maybe you don' either. I think we wont either if this discussion is continued async. Can we chat about this sync a bit?

I'm sorry you don't feel heard, @consideRatio. I don't either :( I'll reach out on slack.

We had a nice conversation on slack and #3304 (review) is a good way forward. We can come back to #3307 later.

consideRatio · 2023-10-24T04:48:23Z

docs/topic/infrastructure/cluster-design.md

@@ -89,6 +91,28 @@ that `prometheus-server` may require more memory than is available.
 On EKS we always use the `r5.xlarge` nodes to avoid running low on allocatable
 pods.

+#### For notebook node pool


I think the use of "notebook" is less clear than "user server" or "jupyter server", and mostly motivated for legacy reasons when https://github.com/jupyter/notebook was the main jupyter server around.

Can we rework this to "user server"?

While I totally agree with you regarding this language in user-facing docs, these are engineering facing docs referring to node pools, and the node pool is literally called "notebook" right now.

So we should either have language that is consistent with our code (i.e., revert back to saying "notebook node pool") or update our code to deploy node pools called "user-server" (which will be a lot of work, but doesn't mean we shouldn't do it necessarily). Third option as interim solution: keep the heading as "user server node pool" and add a callout explaining that in our infrastructure, these are called "notebook node pools".

infrastructure/terraform/gcp/cluster.tf

Line 243 in d4224ce

resource "google_container_node_pool" "notebook" {

I am trying to avoid a situation where an engineer is confused by looking for a node pool in a cloud console or our terraform config called "user server" and it's because it is called "notebook" instead. I wouldn't oppose a proposal where we update our terraform to use "user server" instead. But if we merge docs now, they should be consistent with what exists now since we wouldn't get around to doing that renaming work for a while.

I think if we stick to just speaking about notebook node pools, we need to explain we mean that these are user servers anyhow in the end. So going with that straight up is a plus, but if we retain use of notebook somewhere we need to clarify it still in the end no matter what.

To change resource "google_container_node_pool" "notebook" { in terraform, we need to recreate things - right? I'll check... If we need to re-create things in order to rename that, I think we should just stick with that behind the scenes and comment "notebook" refers to "user server".

Changing resource "google_container_node_pool" "notebook" { prompted a recreation of all nodes, while changing variable names like notebook_nodes didn't - but caused some node version upgrades.

I'm open to settling for anything, but I don't think we can find capacity to re-create all node pools for a naming update in terraform config. With that in mind, I lean towards comrpomises where we still steer towards speaking about "user servers" primarily, and adding callouts or comments about its mentioned as "notebook nodes" behind the scenes sometimes.

@sgibson91, @consideRatio, I just added a new commit, that tries to clarify naming of the actual k8s resource vs what they mean as a concept. Hope this solves both your concerns. LTMK what you think.

docs/topic/infrastructure/cluster-design.md

consideRatio · 2023-10-24T05:05:52Z

docs/topic/infrastructure/cluster-design.md

+  - n2-highmem-16
+  - n2-highmem-64
+- [EKS](https://aws.amazon.com/ec2/instance-types/r5/)
+  - r5.xlarge


I don't feel heard for what I've said @yuvipanda, maybe you don' either. I think we wont either if this discussion is continued async. Can we chat about this sync a bit?

yuvipanda

Given there's differing opinions in https://github.com/2i2c-org/infrastructure/pull/3304/files#r1368816014, I think the way forward is for me to actually just approve this given my time constraints as it's a strict improvement as is over what exists. We can discuss reducing the node size separately, and perhaps that can be more organic when profileLists are discussed as well. As such, I'm approving these changes so I don't block this work.

I generally overall have a high level of trust in @consideRatio and @GeorgianaElena as well to get this right, even if it won't exactly be what I would do :) So am happy to come back to this later and not stop their momentum.

Co-authored-by: Erik Sundell <[email protected]>

… providers

GeorgianaElena · 2023-10-25T12:56:24Z

Note that I plan to merge this at the end of my workday, which is in ~2h if there aren't any objections 🚀

docs/topic/infrastructure/cluster-design.md

Co-authored-by: Sarah Gibson <[email protected]>

GeorgianaElena · 2023-10-25T16:05:56Z

Thank you for catching all those typos @sgibson91 🚀 Merging this in a sec

Add docs about the notebook node pool default choices

ec70266

GeorgianaElena requested a review from a team as a code owner October 23, 2023 13:53

github-actions bot assigned GeorgianaElena Oct 23, 2023

yuvipanda requested changes Oct 23, 2023

View reviewed changes

consideRatio reviewed Oct 24, 2023

View reviewed changes

yuvipanda approved these changes Oct 24, 2023

View reviewed changes

consideRatio mentioned this pull request Oct 24, 2023

Iterate on what instance sizes we ensure is setup for all clusters #3307

Closed

GeorgianaElena and others added 3 commits October 24, 2023 11:21

Rephrase for clarity

7a3c004

Co-authored-by: Erik Sundell <[email protected]>

Use user server isntead of notebook for clarity

450f5df

Co-authored-by: Erik Sundell <[email protected]>

Add exaplanation about differences in naming of the node pools across…

f282fd7

… providers

consideRatio approved these changes Oct 24, 2023

View reviewed changes

consideRatio mentioned this pull request Oct 24, 2023

Q4 Reduced workload goal - Oct 18 Sprint 1 tracking issue #3318

Closed

GeorgianaElena mentioned this pull request Oct 24, 2023

Make all GCP clusters support the instance types 4, 16, and 64 CPU highmem nodes #3319

Merged

10 tasks

sgibson91 approved these changes Oct 25, 2023

View reviewed changes

docs/topic/infrastructure/cluster-design.md Outdated Show resolved Hide resolved

docs/topic/infrastructure/cluster-design.md Outdated Show resolved Hide resolved

GeorgianaElena and others added 2 commits October 25, 2023 19:04

Fix typos

eeb84f3

Co-authored-by: Sarah Gibson <[email protected]>

Fix more typos

721ec3f

Co-authored-by: Sarah Gibson <[email protected]>

GeorgianaElena merged commit ca9af73 into 2i2c-org:master Oct 25, 2023

GeorgianaElena deleted the document-instance-type branch October 25, 2023 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Agreement needed]: add docs about the notebook node pool default choices #3304

[Agreement needed]: add docs about the notebook node pool default choices #3304

GeorgianaElena commented Oct 23, 2023 •

edited by yuvipanda

Loading

yuvipanda Oct 23, 2023

yuvipanda Oct 23, 2023

consideRatio Oct 23, 2023 •

edited

Loading

yuvipanda Oct 24, 2023

yuvipanda Oct 24, 2023

yuvipanda Oct 24, 2023

yuvipanda Oct 24, 2023

consideRatio Oct 24, 2023

yuvipanda Oct 24, 2023

yuvipanda Oct 24, 2023

consideRatio Oct 24, 2023

sgibson91 Oct 24, 2023 •

edited

Loading

sgibson91 Oct 24, 2023

consideRatio Oct 24, 2023

consideRatio Oct 24, 2023 •

edited

Loading

GeorgianaElena Oct 24, 2023 •

edited

Loading

consideRatio Oct 24, 2023

yuvipanda left a comment

GeorgianaElena commented Oct 25, 2023

GeorgianaElena commented Oct 25, 2023

[Agreement needed]: add docs about the notebook node pool default choices #3304

[Agreement needed]: add docs about the notebook node pool default choices #3304

Conversation

GeorgianaElena commented Oct 23, 2023 • edited by yuvipanda Loading

Action points

Future work planned

Choose a reason for hiding this comment

Choose a reason for hiding this comment

consideRatio Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choice of 4 16 64 (set A) or 4 8 32 (set B)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

On Terraform / eksctl

Default ProfileList

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgibson91 Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

consideRatio Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

GeorgianaElena Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuvipanda left a comment

Choose a reason for hiding this comment

GeorgianaElena commented Oct 25, 2023

GeorgianaElena commented Oct 25, 2023

GeorgianaElena commented Oct 23, 2023 •

edited by yuvipanda

Loading

consideRatio Oct 23, 2023 •

edited

Loading

On Terraform / `eksctl`

sgibson91 Oct 24, 2023 •

edited

Loading

consideRatio Oct 24, 2023 •

edited

Loading

GeorgianaElena Oct 24, 2023 •

edited

Loading