Tweak some of the neurohackademy hub resources #1554

GeorgianaElena · 2022-07-22T14:09:07Z

Related to #1532

According to some notes from past events (thanks for those @consideRatio) 6GBof mem should be enough, so I'm setting the guarantee to be 6G
Not sure about CPU though. Is a limit of 2 too little?

github-actions · 2022-07-22T14:10:14Z

Support and Staging deployments

Cloud Provider	Cluster Name	Upgrade Support?	Reason for Support Redeploy	Upgrade Staging?	Reason for Staging Redeploy
gcp	2i2c	No		Yes	Following prod hubs require redeploy: neurohackademy

Production deployments

Cloud Provider	Cluster Name	Hub Name	Reason for Redeploy
gcp	2i2c	neurohackademy	Following helm chart values files were modified: neurohackademy.values.yaml

consideRatio · 2022-07-22T14:53:44Z

config/clusters/2i2c/neurohackademy.values.yaml

+    memory:
+      guarantee: 6G
+      limit: 8G


I see that the current machine choice is n1-highmem-4, 4 CPU and 26 GB memory. The machine type used historically was m1-ultamem-40 with 40 CPU and 961 GB memory.

We planned to support up to 24 GB of memory use, but that was too much and not needed. The user with most memory used was ~6 GB. Assuming that, I think it can make sense to plan for that each user has on average 3 GB of memory and limit them at 6 GB.

Currently, if 6GB memory is guaranteed, and the machines has ~26 GB memory, four users would fit on each machine. The autoscaling limits to 10 machines, so a total of ~40 users would be supported in this configuration.

Something should change, exactly how is not obvious but I'd suggest using bigger machines than 4 core, such as n1-highmem-16 with 16 CPU and 104 GB RAM. If we have 30 users with 3GB ram each on average, then ~4 such machines would cover the 120 users and there would be room for more and a possibility to scale up and down a bit.

Concrete suggestion:

Use n1-highmem-16 machines

Limit memory to for example 8 GB, and guarantee 3 GB of memory per user, which makes the n1-highmem-16 machine end up at ~104/3 = 34 users per machine.

Use very high limits on the CPU, perhaps 50% - 100% of the machines total CPU, to make sure its used properly and without drawbacks I can think of. Guarantee CPU to be something low, such as 0.1 CPU as its not important as long as the number of users are capped based on the memory guarantee already.

I think overall, using a few larger machines, where many (30-100) users fit on a single machine, makes better use of the machines CPU per user than if a smaller machine houses a smaller amount of users, due to how intermittently people actually use the CPU. Also less overhead is needed as the deviation from the mean average RAM usage is less with more users.

Its tricky to come up with sensible estimates about this overall, but I think the above should be fine.

Thank you for these suggestions, @consideRatio. I've implemented all of these, with some minor tweaks. I do think it's important to set non-trivial CPU guarantees - otherwise it only takes two users on the same node using upto their CPU limit to make sure all other users basically get just their CPU guarantee level of CPU. So i've set it to 0.5.

Should we document that recommendation somewhere in our infra docs?

yuvipanda · 2022-07-24T19:14:14Z

terraform plan is:

  # google_container_node_pool.notebook["neurohackademy"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pilot-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 1
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-neurohackademy"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see"
      + version                     = (known after apply)

      + autoscaling {
          + max_node_count = 100
          + min_node_count = 1
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "2i2c.org/community"           = "neurohackademy"
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + machine_type      = "n1-highmem-16"
          + metadata          = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }

      + upgrade_settings {
          + max_surge       = (known after apply)
          + max_unavailable = (known after apply)
        }
    }

- Use n1-highmem-16 nodes - Provide 8GB limit but 4GB requests, so everyone is guaranteed at least 4GB - Set a high CPU limit and a low CPU request. The request will make sure that everyone gets at least that much CPU - it only takes two users on the same VM to eat up all CPU if we don't set these requests

github-actions · 2022-07-24T23:36:45Z

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/workflows/deploy-hubs.yaml?query=branch%3Amaster

damianavila assigned GeorgianaElena Jul 22, 2022

GeorgianaElena mentioned this pull request Jul 22, 2022

Pre-warm the neurohackademy hub #1532

Closed

7 tasks

damianavila requested review from yuvipanda and a team and removed request for a team July 22, 2022 14:36

consideRatio reviewed Jul 22, 2022

View reviewed changes

GeorgianaElena and others added 4 commits July 24, 2022 12:16

Add a separate nodepool for the neurohackademy event

799318f

Add the proper node selector to match the neurohackademy nodepool

07c8e3c

Set up some limits for the neurohackademy singleuser servers

4c62ef8

yuvipanda force-pushed the neuro-resources branch from 7f641ed to 46442c4 Compare July 24, 2022 19:16

yuvipanda mentioned this pull request Jul 24, 2022

[EVENT] NeuroHackademy 2022 #1300

Closed

12 tasks

yuvipanda merged commit 99a3d4e into 2i2c-org:master Jul 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak some of the neurohackademy hub resources #1554

Tweak some of the neurohackademy hub resources #1554

GeorgianaElena commented Jul 22, 2022 •

edited

Loading

github-actions bot commented Jul 22, 2022

consideRatio Jul 22, 2022

yuvipanda Jul 24, 2022

damianavila Jul 27, 2022

yuvipanda commented Jul 24, 2022

github-actions bot commented Jul 24, 2022

Tweak some of the neurohackademy hub resources #1554

Tweak some of the neurohackademy hub resources #1554

Conversation

GeorgianaElena commented Jul 22, 2022 • edited Loading

github-actions bot commented Jul 22, 2022

Support and Staging deployments

Production deployments

consideRatio Jul 22, 2022

Choose a reason for hiding this comment

yuvipanda Jul 24, 2022

Choose a reason for hiding this comment

damianavila Jul 27, 2022

Choose a reason for hiding this comment

yuvipanda commented Jul 24, 2022

github-actions bot commented Jul 24, 2022

GeorgianaElena commented Jul 22, 2022 •

edited

Loading