Setup nodepool for neurohackademy #2758

sgibson91 · 2023-07-04T11:45:40Z

Reconstruction/reversion of PR #1726

Working towards #2681

Reconstruction/reversion of PR 2i2c-org#1726

for more information, see https://pre-commit.ci

sgibson91 · 2023-07-04T11:50:07Z

terraform plan output under the fold. I seemed to have picked up some climatematch related changes which could either be related to #2757 or just an out-of-date state. I will stash my changes and try a refresh only plan first.

tf plan output

``` Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: + create ~ update in-place

Terraform will perform the following actions:

google_container_node_pool.dask_worker["worker"] will be updated in-place

~ resource "google_container_node_pool" "dask_worker" {
id = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/dask-worker"
name = "dask-worker"
~ node_locations = [
- "us-central1-b",
]
# (8 unchanged attributes hidden)

    # (4 unchanged blocks hidden)
}

google_container_node_pool.notebook["climatematch"] will be updated in-place

~ resource "google_container_node_pool" "notebook" {
id = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-climatematch"
name = "nb-climatematch"
~ node_locations = [
- "us-central1-b",
]
# (8 unchanged attributes hidden)

    # (4 unchanged blocks hidden)
}

google_container_node_pool.notebook["neurohackademy"] will be created

resource "google_container_node_pool" "notebook" {
- cluster = "pilot-hubs-cluster"
- id = (known after apply)
- initial_node_count = 1
- instance_group_urls = (known after apply)
- location = (known after apply)
- managed_instance_group_urls = (known after apply)
- max_pods_per_node = (known after apply)
- name = "nb-neurohackademy"
- name_prefix = (known after apply)
- node_count = (known after apply)
- node_locations = (known after apply)
- operation = (known after apply)
- project = "two-eye-two-see"
- version = (known after apply)
- autoscaling {
  - location_policy = (known after apply)
  - max_node_count = 100
  - min_node_count = 1
    }
- management {
  - auto_repair = true
  - auto_upgrade = false
    }
- node_config {
  - disk_size_gb = (known after apply)
  - disk_type = "pd-balanced"
  - guest_accelerator = (known after apply)
  - image_type = (known after apply)
  - labels = {
    - "2i2c.org/community" = "neurohackademy"
    - "hub.jupyter.org/node-purpose" = "user"
    - "k8s.dask.org/node-purpose" = "scheduler"
      }
  - local_ssd_count = (known after apply)
  - logging_variant = "DEFAULT"
  - machine_type = "n1-highmem-n16"
  - metadata = (known after apply)
  - min_cpu_platform = (known after apply)
  - oauth_scopes = [
    - "https://www.googleapis.com/auth/cloud-platform",
      ]
  - preemptible = false
  - service_account = "[email protected]"
  - spot = false
  - tags = []
  - taint = [
    - {
      - effect = "NO_SCHEDULE"
      - key = "hub.jupyter.org_dedicated"
      - value = "user"
        },
        ]
  - workload_metadata_config {
    - mode = "GKE_METADATA"
      }
      }
      }

google_container_node_pool.notebook["user"] will be updated in-place

~ resource "google_container_node_pool" "notebook" {
id = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-user"
name = "nb-user"
~ node_locations = [
- "us-central1-b",
]
# (8 unchanged attributes hidden)

    # (4 unchanged blocks hidden)
}

Plan: 1 to add, 3 to change, 0 to destroy.

Changes to Outputs:
~ regular_channel_latest_k8s_versions = {
~ "1." = "1.26.3-gke.1000" -> "1.27.2-gke.1200"
~ "1.22." = "1.22.17-gke.8000" -> "1.22.17-gke.12700"
~ "1.23." = "1.23.17-gke.2000" -> "1.23.17-gke.6800"
~ "1.24." = "1.24.13-gke.2500" -> "1.24.14-gke.1200"
~ "1.25." = "1.25.8-gke.1000" -> "1.25.10-gke.1200"
}

</details>

pnasrat

LGTM

Please add requested comment to refer to the issue tracking this event then I will approve

pnasrat · 2023-07-04T11:51:33Z

terraform/gcp/projects/pilot-hubs.tfvars

@@ -35,6 +35,15 @@ notebook_nodes = {
    resource_labels : {
      "community" : "climatematch"
    }
+  },
+  "neurohackademy" : {
+    # We expect around 120 users


Can you add a comment linking to the issue

terraform/gcp/projects/pilot-hubs.tfvars

Co-authored-by: Erik Sundell <[email protected]>

sgibson91 · 2023-07-04T12:18:41Z

New tf plan output below. Still seeing some unrelated changes, but I think they are non-destructive.

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # google_container_node_pool.dask_worker["worker"] will be updated in-place
  ~ resource "google_container_node_pool" "dask_worker" {
        id                          = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/dask-worker"
        name                        = "dask-worker"
      ~ node_locations              = [
          - "us-central1-b",
        ]
        # (8 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

  # google_container_node_pool.notebook["climatematch"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-climatematch"
        name                        = "nb-climatematch"
      ~ node_locations              = [
          - "us-central1-b",
        ]
        # (8 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

  # google_container_node_pool.notebook["neurohackademy"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pilot-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 1
      + instance_group_urls         = (known after apply)
      + location                    = (known after apply)
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-neurohackademy"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 1
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "2i2c.org/community"           = "neurohackademy"
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["user"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-user"
        name                        = "nb-user"
      ~ node_locations              = [
          - "us-central1-b",
        ]
        # (8 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

Plan: 1 to add, 3 to change, 0 to destroy.

pnasrat

LGTM

sgibson91 · 2023-07-04T15:49:53Z

Ok, the changes to the climatematch, user and worker nodepools are causing issues for my changes

google_container_node_pool.notebook["climatematch"]: Modifying... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-climatematch]
╷
│ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb'] must be specified.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.RequestInfo",
│     "requestId": "0x470707e08556526a"
│   }
│ ]
│ , badRequest
│ 
│   with google_container_node_pool.notebook["climatematch"],
│   on cluster.tf line 238, in resource "google_container_node_pool" "notebook":
│  238: resource "google_container_node_pool" "notebook" {
│ 
╵
╷
│ Error: Cannot determine zone: set in this resource, or set provider-level zone.
│ 
│   with google_container_node_pool.notebook["neurohackademy"],
│   on cluster.tf line 238, in resource "google_container_node_pool" "notebook":
│  238: resource "google_container_node_pool" "notebook" {
│ 
╵
╷
│ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb'] must be specified.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.RequestInfo",
│     "requestId": "0xcb33b6fe2284f331"
│   }
│ ]
│ , badRequest
│ 
│   with google_container_node_pool.notebook["user"],
│   on cluster.tf line 238, in resource "google_container_node_pool" "notebook":
│  238: resource "google_container_node_pool" "notebook" {
│ 
╵
╷
│ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb'] must be specified.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.RequestInfo",
│     "requestId": "0xbde1b90f3b71a878"
│   }
│ ]
│ , badRequest
│ 
│   with google_container_node_pool.dask_worker["worker"],
│   on cluster.tf line 339, in resource "google_container_node_pool" "dask_worker":
│  339: resource "google_container_node_pool" "dask_worker" {

Resolves errors in 2i2c-org#2758 (comment)

sgibson91 · 2023-07-04T15:55:37Z

I have now defined the expected zones for those nodepools and the new plan output is below, which I suspect will apply cleanly now

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_container_node_pool.notebook["neurohackademy"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pilot-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 1
      + instance_group_urls         = (known after apply)
      + location                    = (known after apply)
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-neurohackademy"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 1
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "2i2c.org/community"           = "neurohackademy"
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

sgibson91 · 2023-07-04T16:00:57Z

Ok, it did not apply cleanly and I will have to define zones for the neurohackademy nodepool too. I have opened #2759 as I suspect this is a bug in how we have constructed the terraform config.

sgibson91 · 2023-07-04T16:04:45Z

Nope, I am still seeing the error even when I have defined a zone for the nodepool

google_container_node_pool.notebook["neurohackademy"]: Creating...
╷
│ Error: Cannot determine zone: set in this resource, or set provider-level zone.
│ 
│   with google_container_node_pool.notebook["neurohackademy"],
│   on cluster.tf line 238, in resource "google_container_node_pool" "notebook":
│  238: resource "google_container_node_pool" "notebook" {
│ 
╵

0057015 (#2758)

pnasrat · 2023-07-05T11:59:51Z

@sgibson91 I'm also seeing issues with node pools and terraform, I'll try debug some this morning.

We have no quota for n2-highmem nodes

Follows https://infrastructure.2i2c.org/howto/features/dedicated-nodepool/

for more information, see https://pre-commit.ci

yuvipanda · 2023-07-05T23:35:18Z

@sgibson91 @pnasrat I've fixed this up, with the following sets of changes:

658ec6c, the primary issue here. terraform state show google_container_cluster.cluster showed me that google_container_cluster.cluster.node_locations is just always null, causing the issue we were facing.
Moved back to n1-highmem, because we didn't have enough quota for n2!
Added taints and resource labels for the dedicated nodepool, as per https://infrastructure.2i2c.org/howto/features/dedicated-nodepool/ (this is only a few weeks old). Without the taints, our dedicated nodepools weren't actually truly dedicated before.

I've applied this as well, so it's ready to merge.

yuvipanda · 2023-07-05T23:39:13Z

The reason this was not caught as part of #2406 is that regional clusters do have google_container_cluster.cluster.node_locations defined correctly. I have validated that the leap tfvars still apply after this change.

yuvipanda · 2023-07-06T00:56:24Z

Given that the event only starts on August 7th, I've set the minimum nodepool size to 0, not 1, as otherweise we'll be spending a lot of money on n1-highmem-16 in that time period. I'll document this shortly.

Document 2i2c-org#2758 (comment)

sgibson91 · 2023-07-06T08:52:11Z

Thank you @yuvipanda!

Setup nodepool for neurohackademy

f4ec56b

Reconstruction/reversion of PR 2i2c-org#1726

sgibson91 requested a review from a team as a code owner July 4, 2023 11:45

[pre-commit.ci] auto fixes from pre-commit.com hooks

b53e3ef

for more information, see https://pre-commit.ci

github-actions bot assigned sgibson91 Jul 4, 2023

pnasrat suggested changes Jul 4, 2023

View reviewed changes

consideRatio reviewed Jul 4, 2023

View reviewed changes

terraform/gcp/projects/pilot-hubs.tfvars Outdated Show resolved Hide resolved

sgibson91 and others added 2 commits July 4, 2023 12:55

Add comment with GitHub issue link

29e7f08

Use n2 machines

fb6a652

Co-authored-by: Erik Sundell <[email protected]>

Add comment about machine type with link to motivation

0d77ee6

consideRatio approved these changes Jul 4, 2023

View reviewed changes

pnasrat approved these changes Jul 4, 2023

View reviewed changes

Define zones for user, worker and climatematch nodepools

204a6fa

Resolves errors in 2i2c-org#2758 (comment)

sgibson91 mentioned this pull request Jul 4, 2023

zones in GCP terraform nodepool config is not actually optional #2759

Closed

Define zone for neurohackademy nodepool

0057015

sgibson91 mentioned this pull request Jul 5, 2023

[Request deployment] New Hub: Neurohackedemy 2023 #2681

Closed

7 tasks

yuvipanda and others added 5 commits July 5, 2023 16:28

Specify node_locations correctly even when not set explicitly

658ec6c

Remove explicit mention of node_locations

ce7aa84

Switch back to using n1-highmem nodes

d50f305

We have no quota for n2-highmem nodes

Setup taints and resource_labels for neurohackademy nodepool

7ceb9f6

Follows https://infrastructure.2i2c.org/howto/features/dedicated-nodepool/

[pre-commit.ci] auto fixes from pre-commit.com hooks

ff0644e

for more information, see https://pre-commit.ci

Remove unnecessary zone specification in dask pool

64939e1

yuvipanda mentioned this pull request Jul 5, 2023

Add neurohackademy hub #2762

Merged

yuvipanda added 3 commits July 5, 2023 17:49

Remove outdated comment about n2- machines

0e3af26

Apply node_locations fix to dask pools as well

2e8c067

Set neurohackademy minimum node pool to 0, not 1

ee3d7d0

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Jul 6, 2023

Add note about initial node-pool size for workshops

bbdc8d0

Document 2i2c-org#2758 (comment)

yuvipanda mentioned this pull request Jul 6, 2023

Add note about initial node-pool size for workshops #2765

Merged

sgibson91 merged commit dc207ab into 2i2c-org:master Jul 6, 2023

sgibson91 deleted the neurohackademy-nodepool branch July 6, 2023 08:52

pnasrat mentioned this pull request Jul 7, 2023

terraform, gcp: node pool creation followup issue diagnosis and resolution #2768

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup nodepool for neurohackademy #2758

Setup nodepool for neurohackademy #2758

sgibson91 commented Jul 4, 2023

sgibson91 commented Jul 4, 2023

google_container_node_pool.dask_worker["worker"] will be updated in-place

google_container_node_pool.notebook["climatematch"] will be updated in-place

google_container_node_pool.notebook["neurohackademy"] will be created

google_container_node_pool.notebook["user"] will be updated in-place

pnasrat left a comment

pnasrat Jul 4, 2023

sgibson91 Jul 4, 2023

sgibson91 commented Jul 4, 2023 •

edited

Loading

pnasrat left a comment

sgibson91 commented Jul 4, 2023 •

edited

Loading

sgibson91 commented Jul 4, 2023

sgibson91 commented Jul 4, 2023

sgibson91 commented Jul 4, 2023

pnasrat commented Jul 5, 2023

yuvipanda commented Jul 5, 2023

yuvipanda commented Jul 5, 2023 •

edited

Loading

yuvipanda commented Jul 6, 2023

sgibson91 commented Jul 6, 2023

Setup nodepool for neurohackademy #2758

Setup nodepool for neurohackademy #2758

Conversation

sgibson91 commented Jul 4, 2023

sgibson91 commented Jul 4, 2023

google_container_node_pool.dask_worker["worker"] will be updated in-place

google_container_node_pool.notebook["climatematch"] will be updated in-place

google_container_node_pool.notebook["neurohackademy"] will be created

google_container_node_pool.notebook["user"] will be updated in-place

pnasrat left a comment

Choose a reason for hiding this comment

pnasrat Jul 4, 2023

Choose a reason for hiding this comment

sgibson91 Jul 4, 2023

Choose a reason for hiding this comment

sgibson91 commented Jul 4, 2023 • edited Loading

pnasrat left a comment

Choose a reason for hiding this comment

sgibson91 commented Jul 4, 2023 • edited Loading

sgibson91 commented Jul 4, 2023

sgibson91 commented Jul 4, 2023

sgibson91 commented Jul 4, 2023

pnasrat commented Jul 5, 2023

yuvipanda commented Jul 5, 2023

yuvipanda commented Jul 5, 2023 • edited Loading

yuvipanda commented Jul 6, 2023

sgibson91 commented Jul 6, 2023

sgibson91 commented Jul 4, 2023 •

edited

Loading

sgibson91 commented Jul 4, 2023 •

edited

Loading

yuvipanda commented Jul 5, 2023 •

edited

Loading