Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak some of the neurohackademy hub resources #1554

Merged
merged 4 commits into from
Jul 24, 2022

Conversation

GeorgianaElena
Copy link
Member

@GeorgianaElena GeorgianaElena commented Jul 22, 2022

Related to #1532

  • According to some notes from past events (thanks for those @consideRatio) 6GBof mem should be enough, so I'm setting the guarantee to be 6G
  • Not sure about CPU though. Is a limit of 2 too little?

@github-actions
Copy link

Support and Staging deployments

Cloud Provider Cluster Name Upgrade Support? Reason for Support Redeploy Upgrade Staging? Reason for Staging Redeploy
gcp 2i2c No Yes Following prod hubs require redeploy: neurohackademy

Production deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
gcp 2i2c neurohackademy Following helm chart values files were modified: neurohackademy.values.yaml

@damianavila damianavila requested review from yuvipanda and a team and removed request for a team July 22, 2022 14:36
Comment on lines 48 to 50
memory:
guarantee: 6G
limit: 8G
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the current machine choice is n1-highmem-4, 4 CPU and 26 GB memory. The machine type used historically was m1-ultamem-40 with 40 CPU and 961 GB memory.

We planned to support up to 24 GB of memory use, but that was too much and not needed. The user with most memory used was ~6 GB. Assuming that, I think it can make sense to plan for that each user has on average 3 GB of memory and limit them at 6 GB.

Currently, if 6GB memory is guaranteed, and the machines has ~26 GB memory, four users would fit on each machine. The autoscaling limits to 10 machines, so a total of ~40 users would be supported in this configuration.

Something should change, exactly how is not obvious but I'd suggest using bigger machines than 4 core, such as n1-highmem-16 with 16 CPU and 104 GB RAM. If we have 30 users with 3GB ram each on average, then ~4 such machines would cover the 120 users and there would be room for more and a possibility to scale up and down a bit.

Concrete suggestion:

  • Use n1-highmem-16 machines

  • Limit memory to for example 8 GB, and guarantee 3 GB of memory per user, which makes the n1-highmem-16 machine end up at ~104/3 = 34 users per machine.

  • Use very high limits on the CPU, perhaps 50% - 100% of the machines total CPU, to make sure its used properly and without drawbacks I can think of. Guarantee CPU to be something low, such as 0.1 CPU as its not important as long as the number of users are capped based on the memory guarantee already.

I think overall, using a few larger machines, where many (30-100) users fit on a single machine, makes better use of the machines CPU per user than if a smaller machine houses a smaller amount of users, due to how intermittently people actually use the CPU. Also less overhead is needed as the deviation from the mean average RAM usage is less with more users.

Its tricky to come up with sensible estimates about this overall, but I think the above should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these suggestions, @consideRatio. I've implemented all of these, with some minor tweaks. I do think it's important to set non-trivial CPU guarantees - otherwise it only takes two users on the same node using upto their CPU limit to make sure all other users basically get just their CPU guarantee level of CPU. So i've set it to 0.5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document that recommendation somewhere in our infra docs?

@yuvipanda
Copy link
Member

terraform plan is:

  # google_container_node_pool.notebook["neurohackademy"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pilot-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 1
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-neurohackademy"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see"
      + version                     = (known after apply)

      + autoscaling {
          + max_node_count = 100
          + min_node_count = 1
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "2i2c.org/community"           = "neurohackademy"
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + machine_type      = "n1-highmem-16"
          + metadata          = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + shielded_instance_config {
              + enable_integrity_monitoring = (known after apply)
              + enable_secure_boot          = (known after apply)
            }

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }

      + upgrade_settings {
          + max_surge       = (known after apply)
          + max_unavailable = (known after apply)
        }
    }

GeorgianaElena and others added 4 commits July 24, 2022 12:16
- Use n1-highmem-16 nodes
- Provide 8GB limit but 4GB requests, so everyone is guaranteed
  at least 4GB
- Set a high CPU limit and a low CPU request. The request will make
  sure that everyone gets at least that much CPU - it only takes
  two users on the same VM to eat up all CPU if we don't set these
  requests
@yuvipanda yuvipanda mentioned this pull request Jul 24, 2022
12 tasks
@yuvipanda yuvipanda merged commit 99a3d4e into 2i2c-org:master Jul 24, 2022
@github-actions
Copy link

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/workflows/deploy-hubs.yaml?query=branch%3Amaster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants