Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make all GCP clusters support the instance types 4, 16, and 64 CPU highmem nodes #3319

Merged

Conversation

GeorgianaElena
Copy link
Member

@GeorgianaElena GeorgianaElena commented Oct 24, 2023

Follow-up to #3304
Fixes #3256

TODO

Run terraform plan & apply for:

  • 2i2c-uk.tfvars
terraform plan -var-file=projects/callysto.tfvars
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # google_container_node_pool.notebook["n2-highmem-16"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "two-eye-two-see-uk-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "europe-west2"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-16"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see-uk"
      + version                     = "1.27.4-gke.900"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "two-eye-two-see-uk-cluster-sa@two-eye-two-see-uk.iam.gserviceaccount.com"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "two-eye-two-see-uk-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "europe-west2"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see-uk"
      + version                     = "1.27.4-gke.900"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "two-eye-two-see-uk-cluster-sa@two-eye-two-see-uk.iam.gserviceaccount.com"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["user"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/two-eye-two-see-uk/locations/europe-west2/clusters/two-eye-two-see-uk-cluster/nodePools/nb-user"
        name                        = "nb-user"
        # (9 unchanged attributes hidden)

      ~ autoscaling {
          ~ max_node_count       = 20 -> 100
            # (4 unchanged attributes hidden)
        }

        # (3 unchanged blocks hidden)
    }

Plan: 2 to add, 1 to change, 0 to destroy.
  • callysto.tfvars
terraform plan -var-file=projects/callysto.tfvars
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # google_container_node_pool.notebook["n2-highmem-16"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "callysto-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "northamerica-northeast1"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-16"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "callysto-202316"
      + version                     = "1.27.4-gke.900"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "callysto-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "northamerica-northeast1"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "callysto-202316"
      + version                     = "1.27.4-gke.900"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["user"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/callysto-202316/locations/northamerica-northeast1/clusters/callysto-cluster/nodePools/nb-user"
        name                        = "nb-user"
        # (9 unchanged attributes hidden)

      ~ autoscaling {
          ~ max_node_count       = 20 -> 100
            # (4 unchanged attributes hidden)
        }

        # (3 unchanged blocks hidden)
    }

Plan: 2 to add, 1 to change, 0 to destroy.
  • cloudbank.tfvars
terraform plan -var-file=projects/cloudbank.tfvars
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # google_container_node_pool.notebook["n2-highmem-16"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "cb-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-16"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "cb-1003-1696"
      + version                     = "1.26.4-gke.1400"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "cb-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "cb-1003-1696"
      + version                     = "1.26.4-gke.1400"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["user"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/cb-1003-1696/locations/us-central1-b/clusters/cb-cluster/nodePools/nb-user"
        name                        = "nb-user"
        # (9 unchanged attributes hidden)

      ~ autoscaling {
          ~ max_node_count       = 20 -> 100
            # (4 unchanged attributes hidden)
        }

        # (3 unchanged blocks hidden)
    }

Plan: 2 to add, 1 to change, 0 to destroy.
  • hhmi.tfvars
terraform plan -var-file=projects/hhmi.tfvars
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_container_node_pool.notebook["n2-highmem-4"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "hhmi-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-west2"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-4"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "hhmi-398911"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-4"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "hhmi-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-west2"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "hhmi-398911"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.
  • linked-earth.tfvars
terraform plan -var-file=projects/linked-earth.tfvars
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "linked-earth-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "linked-earth-hubs"
      + version                     = "1.27.4-gke.900"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "linked-earth-cluster-sa@linked-earth-hubs.iam.gserviceaccount.com"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.
  • m2lines.tfvars
terraform plan -var-file=projects/m2lines.tfvars
  # google_container_node_pool.notebook["n2-highmem-16"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "m2lines-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-16"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "m2lines-hub"
      + version                     = "1.27.4-gke.900"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-4"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "m2lines-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-4"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "m2lines-hub"
      + version                     = "1.27.4-gke.900"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-4"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "m2lines-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "m2lines-hub"
      + version                     = "1.27.4-gke.900"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 3 to add, 0 to change, 0 to destroy.
  • pilot-hubs.tfvars
terraform plan -var-file=projects/pilot-hubs.tfvars
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # google_container_node_pool.notebook["n2-highmem-16"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pilot-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-16"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see"
      + version                     = "1.26.4-gke.1400"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pilot-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see"
      + version                     = "1.26.4-gke.1400"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["user"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-user"
        name                        = "nb-user"
        # (9 unchanged attributes hidden)

      ~ autoscaling {
          ~ max_node_count       = 20 -> 100
            # (4 unchanged attributes hidden)
        }

        # (3 unchanged blocks hidden)
    }

Plan: 2 to add, 1 to change, 0 to destroy.
  • leap.tfvars !!!
terraform plan -var-file=projects/leap.tfvars
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # google_container_cluster.cluster will be updated in-place
  ~ resource "google_container_cluster" "cluster" {
        id                          = "projects/leap-pangeo/locations/us-central1/clusters/leap-cluster"
        name                        = "leap-cluster"
        # (27 unchanged attributes hidden)

      ~ cluster_autoscaling {
          ~ enabled             = true -> false
            # (1 unchanged attribute hidden)

          - resource_limits {
              - maximum       = 26112 -> null
              - minimum       = 1 -> null
              - resource_type = "memory" -> null
            }
          - resource_limits {
              - maximum       = 3264 -> null
              - minimum       = 1 -> null
              - resource_type = "cpu" -> null
            }
          - resource_limits {
              - maximum       = 1024 -> null
              - minimum       = 1 -> null
              - resource_type = "nvidia-tesla-a100" -> null
            }
          - resource_limits {
              - maximum       = 1024 -> null
              - minimum       = 1 -> null
              - resource_type = "nvidia-tesla-k80" -> null
            }
          - resource_limits {
              - maximum       = 1024 -> null
              - minimum       = 1 -> null
              - resource_type = "nvidia-tesla-p100" -> null
            }
          - resource_limits {
              - maximum       = 1024 -> null
              - minimum       = 1 -> null
              - resource_type = "nvidia-tesla-p4" -> null
            }
          - resource_limits {
              - maximum       = 1024 -> null
              - minimum       = 1 -> null
              - resource_type = "nvidia-tesla-t4" -> null
            }
          - resource_limits {
              - maximum       = 1024 -> null
              - minimum       = 1 -> null
              - resource_type = "nvidia-tesla-v100" -> null
            }

            # (1 unchanged block hidden)
        }

        # (23 unchanged blocks hidden)
    }

  # google_container_node_pool.notebook["n2-highmem-4"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "leap-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-4"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "leap-pangeo"
      + version                     = "1.25.6-gke.1000"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-4"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "leap-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "leap-pangeo"
      + version                     = "1.25.6-gke.1000"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 2 to add, 1 to change, 0 to destroy.
  • qcl.tfvars !!!!
terraform plan -var-file=projects/qcl.tfvars
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # google_container_cluster.cluster will be updated in-place
  ~ resource "google_container_cluster" "cluster" {
        id                          = "projects/qcl-hub/locations/europe-west1/clusters/qcl-cluster"
      + min_master_version          = "1.25.10-gke.2700"
        name                        = "qcl-cluster"
        # (26 unchanged attributes hidden)

        # (26 unchanged blocks hidden)
    }

  # google_container_node_pool.notebook["huge"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/qcl-hub/locations/europe-west1/clusters/qcl-cluster/nodePools/nb-huge"
        name                        = "nb-huge"
      ~ version                     = "1.24.11-gke.1000" -> "1.24.9-gke.3200"
        # (8 unchanged attributes hidden)

        # (5 unchanged blocks hidden)
    }

  # google_container_node_pool.notebook["large"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/qcl-hub/locations/europe-west1/clusters/qcl-cluster/nodePools/nb-large"
        name                        = "nb-large"
      ~ version                     = "1.24.11-gke.1000" -> "1.24.9-gke.3200"
        # (8 unchanged attributes hidden)

        # (5 unchanged blocks hidden)
    }

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "qcl-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "europe-west1"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "qcl-hub"
      + version                     = "1.24.9-gke.3200"

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 1 to add, 3 to change, 0 to destroy.

Changes to Outputs:
  ~ regular_channel_latest_k8s_versions = {
      ~ "1."    = "1.27.3-gke.1700" -> "1.27.4-gke.900"
      ~ "1.24." = "1.24.15-gke.1700" -> "1.24.16-gke.500"
      ~ "1.25." = "1.25.11-gke.1700" -> "1.25.12-gke.500"
      ~ "1.26." = "1.26.6-gke.1700" -> "1.26.7-gke.500"
      ~ "1.27." = "1.27.3-gke.1700" -> "1.27.4-gke.900"
    }
  • pangeo-hubs.tfvars
terraform plan -var-file=projects/pangeo-hubs.tfvars
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform planned the following actions, but then encountered a problem:

  # google_container_node_pool.notebook["n2-highmem-16"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pangeo-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-16"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "pangeo-integration-te-3eea"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-4"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pangeo-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-4"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "pangeo-integration-te-3eea"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-4"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["n2-highmem-64"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pangeo-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-n2-highmem-64"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "pangeo-integration-te-3eea"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-64"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 3 to add, 0 to change, 0 to destroy.
╷
│ Warning: Failed to decode resource from state
│ 
│ Error decoding "google_monitoring_alert_policy.disk_space_full_alert" from prior state: unsupported attribute "condition_prometheus_query_language"
╵
╷
│ Error: Failed to get the data key required to decrypt the SOPS file.
│ 
│ Group 0: FAILED
│   projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs: FAILED
│     - | Error decrypting key: googleapi: Error 403: Permission
│       | 'cloudkms.cryptoKeyVersions.useToDecrypt' denied on resource
│       | 'projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs'| (or it may not exist)., forbidden
│ 
│ Recovery failed because no master key was able to decrypt the file. In
│ order for SOPS to recover the file, at least one key has to be successful,
│ but none were.
│ 
│   with data.sops_file.pagerduty_service_integration_keys,
│   on pagerduty.tf line 13, in data "sops_file" "pagerduty_service_integration_keys":
│   13: data "sops_file" "pagerduty_service_integration_keys" {
│ 

@GeorgianaElena GeorgianaElena changed the title Mak all clusters support the instance types 4, 16, and 64 CPU highmem nodes Make all clusters support the instance types 4, 16, and 64 CPU highmem nodes Oct 24, 2023
Copy link
Contributor

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have quite a few references to what "large" etc means ;)

image

Due to that, I favor the non-relative naming like in general, like here:

image

I don't mind not taking action about this now if changes are made already, but its a slight preference.

@GeorgianaElena
Copy link
Member Author

@consideRatio, you're totally right. I did went with using more generic names like: n2-highmem-4", but then I saw that we call them "small", "medium" and "large" in the template file, so I figured that going with these names will minimize the number of disrupting renaming we will do in the future (assuming we might want to make sure all these machines are available under these names).

Happy to change them back to generic names, including the template. It did feel awkward to call that one larger 😅

@consideRatio
Copy link
Contributor

Notes form sync chat.

  • Update template files to use absolute naming
  • Update some but not all to use absolute naming side by side with previous naming
  • (future, doesn't have to be by @GeorgianaElena in this PR) Create a followup issue to help us progressively work towards absolute naming of user nodes in GKE and AKS

@GeorgianaElena GeorgianaElena changed the title Make all clusters support the instance types 4, 16, and 64 CPU highmem nodes Make all GCP clusters support the instance types 4, 16, and 64 CPU highmem nodes Oct 27, 2023
@GeorgianaElena GeorgianaElena marked this pull request as ready for review October 27, 2023 08:29
@GeorgianaElena GeorgianaElena requested a review from a team as a code owner October 27, 2023 08:29
Copy link
Contributor

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wieee thank you for working this @GeorgianaElena!!

There were some style changes to not adopt trailing commas, and I looked into this to conclude the autoformatting by terraform fmt doesn't help with enforcing that. Since its not enforced by autoformatting and requires manual thought, we shouldn't bother thinking about it further I think!

Let's go for a merge!

@consideRatio
Copy link
Contributor

consideRatio commented Oct 28, 2023

Out of scope for this PR, but related: I got an idea on how we can help nudge a transition of things over time btw! Next to each thing we want to change in each terraform/eksctl file - we inline a comment like "FIXME: Update this to ... when given the chance".

Like that we provide a quite easy to resolve fixme note when something else is done anyhow by someone for example doing k8s upgrade maintenance.

@GeorgianaElena
Copy link
Member Author

Thank you @consideRatio! I've just added a commit to add the suggested fixme comments and the trailing commas and will now start terraform plan + apply the chnages.

Copy link
Contributor

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wiee nice!

@GeorgianaElena
Copy link
Member Author

Update

I have ran terraform plan & apply for all but three clusters.

In the top comment I pasted the terraform plan output of all of them.

The ones without a check I did not ran terraform apply for because of the changes it wished to make to the current infra. Summary:

1. leap

Leap appears to have cluster_autoscaling enabled, which is not reflected in its terraform config

2. qcl

It seems it wishes to make some updates to the existing nodepools. Don't undestand why?

3. pangeo-hubs

At the end of the plan output, it says:

│ Warning: Failed to decode resource from state
│ 
│ Error decoding "google_monitoring_alert_policy.disk_space_full_alert" from prior state: unsupported attribute "condition_prometheus_query_language"
╵
╷
│ Error: Failed to get the data key required to decrypt the SOPS file.
│ 
│ Group 0: FAILED
│   projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs: FAILED
│     - | Error decrypting key: googleapi: Error 403: Permission
│       | 'cloudkms.cryptoKeyVersions.useToDecrypt' denied on resource
│       | 'projects/two-eye-two-see/locations/global/keyRings/sops-keys/cryptoKeys/similar-hubs'
│       | (or it may not exist)., forbidden
│ 
│ Recovery failed because no master key was able to decrypt the file. In
│ order for SOPS to recover the file, at least one key has to be successful,
│ but none were.
│ 
│   with data.sops_file.pagerduty_service_integration_keys,
│   on pagerduty.tf line 13, in data "sops_file" "pagerduty_service_integration_keys":
│   13: data "sops_file" "pagerduty_service_integration_keys" {
│ 

@sgibson91
Copy link
Member

@GeorgianaElena for pangeo hubs, you'll have to log into the gcloud cli using your Columbia email I think

@GeorgianaElena
Copy link
Member Author

GeorgianaElena commented Oct 30, 2023

@GeorgianaElena for pangeo hubs, you'll have to log into the gcloud cli using your Columbia email I think

Thanks @sgibson91. I did that, but I think the issuse is that my columbia account doesn't have permissions to access the sops decryption key which is stored in the two-eye-two-see gcloud project. Will manually add myself there and make a note in the terraform file if this fixes it.

Update

Yes, granting my columbia account kms encryptor/decryptor permissions in the two-eye-two-see project fixed the issue for pangeo-hubs and I was able to terraform apply

Remaining clusters with terraform apply issues: leap and qcl

@consideRatio
Copy link
Contributor

LEAP:

Leap appears to have cluster_autoscaling enabled, which is not reflected in its terraform config

I think node auto-provisioning has been enabled as part of Yuvi trialing things in #3287, and that relies on adjusting the GKE managed cluster autoscaler which isn't running inside the k8s cluster as it does on EKS.

QCL:

It seems it wishes to make some updates to the existing nodepools. Don't undestand why?

What kind of updates? Are they related to k8s node versions? I then suspect its a remnant of a k8s cluster upgrade, where node pools wasn't updated as part of the k8s api-server being upgraded.

@consideRatio
Copy link
Contributor

consideRatio commented Oct 30, 2023

OMG amazing summary of your actions in the PR description @GeorgianaElena, looking now!

Hmmm, so node pools are being downgraded... "1.24.11-gke.1000" -> "1.24.9-gke.3200"

@consideRatio
Copy link
Contributor

@GeorgianaElena I think its fine that we update large and huge in place because they are not used currently. So they can be re-created / updated without issues to get the same k8s version pinned. Apparently they have a more modern k8s version than most other nodes.

This cluster was created without pinned k8s versions for the nodes and hasn't been upgraded to align all nodes version since the pinning was introduced in the terraform config.

image

@consideRatio
Copy link
Contributor

consideRatio commented Oct 30, 2023

@GeorgianaElena I went for the QCL node pools changes while they remained inactive - QCL complete!

@consideRatio
Copy link
Contributor

@GeorgianaElena I'm quite confident that #3287 was causing the LEAP issues. I suggest we simply merge this as it is for now though as I don't think its in scope for this PR to resolve it.

@GeorgianaElena
Copy link
Member Author

@GeorgianaElena I went for the QCL node pools changes while they remained inactive - QCL complete!

Amazing! Thank you @consideRatio <3

@GeorgianaElena I'm quite confident that #3287 was causing the LEAP issues. I suggest we simply merge this as it is for now though as I don't think its in scope for this PR to resolve it.

Thanks @consideRatio! Then I will merge this now since almost everything was terraform applied.

@GeorgianaElena GeorgianaElena merged commit dd8760f into 2i2c-org:master Oct 31, 2023
1 check passed
@GeorgianaElena GeorgianaElena deleted the add-default-machine-types branch October 31, 2023 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

Let all clusters cloud infra support the instance types 4, 16, and 64 CPU highmem nodes
3 participants