Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcp, dask-worker-nodes: pangeo-hubs to use single dask worker node type #3024

Merged
merged 4 commits into from
Aug 25, 2023

Conversation

consideRatio
Copy link
Contributor

@consideRatio consideRatio commented Aug 24, 2023

pangeo-hubs is the last 2i2c cluster that has multiple dask worker node types, so with this terraform applied and merged we can fix #2687.

If we get all clusters to use a single node type with 16 CPU and 128 GB of memory (r5.4xlarge / n2-highmem-16), it enables us to provide good defaults for users using dask-gateway when they decide on how powerful their workers are to be. This is planned in #2687.

I'm not able to get this all the way through myself though as I lack access to the infrastructure.

Action plan

  • Someone else approves this PR
  • I check from time to time if there are dask worker nodes active, and when they aren't asks for help
  • Someone else applies this terraform change
  • I merge the PR

Current activity

gke-pangeo-hubs-cluster-dask-medium-552f8a1e-6ndl   Ready    <none>   16h     v1.26.4-gke.1400
gke-pangeo-hubs-cluster-dask-medium-552f8a1e-ssll   Ready    <none>   21h     v1.26.4-gke.1400
gke-pangeo-hubs-cluster-dask-medium-552f8a1e-z6wp   Ready    <none>   3h23m   v1.26.4-gke.1400

Grafana dashboard at https://grafana.gcp.pangeo.2i2c.cloud is down because prometheus is crashing, so I can't understand if there is a history of always having dask worker nodes active or similar. I can get a brief response before it crashes, but it indicates no data is available anyhow...

support-prometheus-server-7c4f454847-6h9h6          2/2     Running   21 (2m19s ago)   17d

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 24, 2023

The apply steps are something like below I think:

# A GCP account with permissions to the GCP project columbia at
# https://console.cloud.google.com/iam-admin/iam?project=columbia
# is required, and we don't have access with our @2i2c.org accounts
gcloud auth login --update-adc

gh pr checkout 3024
cd terraform/gcp
rm -rf .terraform

terraform init -backend-config backends/pangeo-backend.hcl
terraform workspace list
terraform workspace select pangeo-hubs

terraform apply --var-file projects/pangeo-hubs.tfvars

@GeorgianaElena
Copy link
Member

terraform plan output below:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  - destroy
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # google_container_node_pool.core must be replaced
-/+ resource "google_container_node_pool" "core" {
      ~ id                          = "projects/pangeo-integration-te-3eea/locations/us-central1-b/clusters/pangeo-hubs-cluster/nodePools/core-pool" -> (known after apply)
      ~ instance_group_urls         = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroupManagers/gke-pangeo-hubs-cluster-core-pool-c8492309-grp",
        ] -> (known after apply)
      ~ managed_instance_group_urls = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-core-pool-c8492309-grp",
        ] -> (known after apply)
      ~ max_pods_per_node           = 110 -> (known after apply)
        name                        = "core-pool"
      + name_prefix                 = (known after apply)
      ~ node_count                  = 2 -> (known after apply)
      ~ node_locations              = [
          - "us-central1-b",
        ] -> (known after apply)
      + operation                   = (known after apply)
      ~ version                     = "1.26.4-gke.1400" -> (known after apply)
        # (4 unchanged attributes hidden)

      ~ autoscaling {
          ~ location_policy      = "BALANCED" -> (known after apply)
          - total_max_node_count = 0 -> null
          - total_min_node_count = 0 -> null
            # (2 unchanged attributes hidden)
        }

      - network_config {
          - create_pod_range     = false -> null
          - enable_private_nodes = false -> null
          - pod_ipv4_cidr_block  = "10.8.0.0/14" -> null
          - pod_range            = "gke-pangeo-hubs-cluster-pods-14554e9f" -> null
        }

      ~ node_config {
          ~ disk_type         = "pd-balanced" -> (known after apply)
          ~ guest_accelerator = [] -> (known after apply)
          ~ image_type        = "COS_CONTAINERD" -> (known after apply)
          ~ local_ssd_count   = 0 -> (known after apply)
          ~ machine_type      = "n2-highmem-8" -> "n2-highmem-4" # forces replacement
          ~ metadata          = {
              - "disable-legacy-endpoints" = "true"
            } -> (known after apply)
          + min_cpu_platform  = (known after apply)
          - resource_labels   = {} -> null
            tags              = []
          ~ taint             = [] -> (known after apply)
            # (7 unchanged attributes hidden)

          - shielded_instance_config {
              - enable_integrity_monitoring = true -> null
              - enable_secure_boot          = false -> null
            }

          - workload_metadata_config {
              - mode = "GKE_METADATA" -> null
            }
        }

      - upgrade_settings {
          - max_surge       = 1 -> null
          - max_unavailable = 0 -> null
          - strategy        = "SURGE" -> null
        }

        # (1 unchanged block hidden)
    }

  # google_container_node_pool.dask_worker["large"] will be destroyed
  # (because key ["large"] is not in for_each map)
  - resource "google_container_node_pool" "dask_worker" {
      - cluster                     = "pangeo-hubs-cluster" -> null
      - id                          = "projects/pangeo-integration-te-3eea/locations/us-central1-b/clusters/pangeo-hubs-cluster/nodePools/dask-large" -> null
      - initial_node_count          = 0 -> null
      - instance_group_urls         = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroupManagers/gke-pangeo-hubs-cluster-dask-large-0a156e10-grp",
        ] -> null
      - location                    = "us-central1-b" -> null
      - managed_instance_group_urls = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-dask-large-0a156e10-grp",
        ] -> null
      - max_pods_per_node           = 110 -> null
      - name                        = "dask-large" -> null
      - node_count                  = 0 -> null
      - node_locations              = [
          - "us-central1-b",
        ] -> null
      - project                     = "pangeo-integration-te-3eea" -> null
      - version                     = "1.26.4-gke.1400" -> null

      - autoscaling {
          - location_policy      = "ANY" -> null
          - max_node_count       = 100 -> null
          - min_node_count       = 0 -> null
          - total_max_node_count = 0 -> null
          - total_min_node_count = 0 -> null
        }

      - management {
          - auto_repair  = true -> null
          - auto_upgrade = false -> null
        }

      - network_config {
          - create_pod_range     = false -> null
          - enable_private_nodes = false -> null
          - pod_ipv4_cidr_block  = "10.8.0.0/14" -> null
          - pod_range            = "gke-pangeo-hubs-cluster-pods-14554e9f" -> null
        }

      - node_config {
          - disk_size_gb      = 100 -> null
          - disk_type         = "pd-balanced" -> null
          - guest_accelerator = [] -> null
          - image_type        = "COS_CONTAINERD" -> null
          - labels            = {
              - "k8s.dask.org/node-purpose" = "worker"
            } -> null
          - local_ssd_count   = 0 -> null
          - logging_variant   = "DEFAULT" -> null
          - machine_type      = "n1-standard-16" -> null
          - metadata          = {
              - "disable-legacy-endpoints" = "true"
            } -> null
          - oauth_scopes      = [
              - "https://www.googleapis.com/auth/cloud-platform",
            ] -> null
          - preemptible       = true -> null
          - resource_labels   = {} -> null
          - service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com" -> null
          - spot              = false -> null
          - tags              = [] -> null
          - taint             = [
              - {
                  - effect = "NO_SCHEDULE"
                  - key    = "k8s.dask.org_dedicated"
                  - value  = "worker"
                },
            ] -> null

          - shielded_instance_config {
              - enable_integrity_monitoring = true -> null
              - enable_secure_boot          = false -> null
            }

          - workload_metadata_config {
              - mode = "GKE_METADATA" -> null
            }
        }

      - upgrade_settings {
          - max_surge       = 1 -> null
          - max_unavailable = 0 -> null
          - strategy        = "SURGE" -> null
        }
    }

  # google_container_node_pool.dask_worker["medium"] will be destroyed
  # (because key ["medium"] is not in for_each map)
  - resource "google_container_node_pool" "dask_worker" {
      - cluster                     = "pangeo-hubs-cluster" -> null
      - id                          = "projects/pangeo-integration-te-3eea/locations/us-central1-b/clusters/pangeo-hubs-cluster/nodePools/dask-medium" -> null
      - initial_node_count          = 0 -> null
      - instance_group_urls         = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroupManagers/gke-pangeo-hubs-cluster-dask-medium-552f8a1e-grp",
        ] -> null
      - location                    = "us-central1-b" -> null
      - managed_instance_group_urls = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-dask-medium-552f8a1e-grp",
        ] -> null
      - max_pods_per_node           = 110 -> null
      - name                        = "dask-medium" -> null
      - node_count                  = 0 -> null
      - node_locations              = [
          - "us-central1-b",
        ] -> null
      - project                     = "pangeo-integration-te-3eea" -> null
      - version                     = "1.26.4-gke.1400" -> null

      - autoscaling {
          - location_policy      = "ANY" -> null
          - max_node_count       = 100 -> null
          - min_node_count       = 0 -> null
          - total_max_node_count = 0 -> null
          - total_min_node_count = 0 -> null
        }

      - management {
          - auto_repair  = true -> null
          - auto_upgrade = false -> null
        }

      - network_config {
          - create_pod_range     = false -> null
          - enable_private_nodes = false -> null
          - pod_ipv4_cidr_block  = "10.8.0.0/14" -> null
          - pod_range            = "gke-pangeo-hubs-cluster-pods-14554e9f" -> null
        }

      - node_config {
          - disk_size_gb      = 100 -> null
          - disk_type         = "pd-balanced" -> null
          - guest_accelerator = [] -> null
          - image_type        = "COS_CONTAINERD" -> null
          - labels            = {
              - "k8s.dask.org/node-purpose" = "worker"
            } -> null
          - local_ssd_count   = 0 -> null
          - logging_variant   = "DEFAULT" -> null
          - machine_type      = "n1-standard-8" -> null
          - metadata          = {
              - "disable-legacy-endpoints" = "true"
            } -> null
          - oauth_scopes      = [
              - "https://www.googleapis.com/auth/cloud-platform",
            ] -> null
          - preemptible       = true -> null
          - resource_labels   = {} -> null
          - service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com" -> null
          - spot              = false -> null
          - tags              = [] -> null
          - taint             = [
              - {
                  - effect = "NO_SCHEDULE"
                  - key    = "k8s.dask.org_dedicated"
                  - value  = "worker"
                },
            ] -> null

          - shielded_instance_config {
              - enable_integrity_monitoring = true -> null
              - enable_secure_boot          = false -> null
            }

          - workload_metadata_config {
              - mode = "GKE_METADATA" -> null
            }
        }

      - upgrade_settings {
          - max_surge       = 1 -> null
          - max_unavailable = 0 -> null
          - strategy        = "SURGE" -> null
        }
    }

  # google_container_node_pool.dask_worker["small"] will be destroyed
  # (because key ["small"] is not in for_each map)
  - resource "google_container_node_pool" "dask_worker" {
      - cluster                     = "pangeo-hubs-cluster" -> null
      - id                          = "projects/pangeo-integration-te-3eea/locations/us-central1-b/clusters/pangeo-hubs-cluster/nodePools/dask-small" -> null
      - initial_node_count          = 0 -> null
      - instance_group_urls         = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroupManagers/gke-pangeo-hubs-cluster-dask-small-ab203ba0-grp",
        ] -> null
      - location                    = "us-central1-b" -> null
      - managed_instance_group_urls = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-dask-small-ab203ba0-grp",
        ] -> null
      - max_pods_per_node           = 110 -> null
      - name                        = "dask-small" -> null
      - node_count                  = 0 -> null
      - node_locations              = [
          - "us-central1-b",
        ] -> null
      - project                     = "pangeo-integration-te-3eea" -> null
      - version                     = "1.26.4-gke.1400" -> null

      - autoscaling {
          - location_policy      = "ANY" -> null
          - max_node_count       = 100 -> null
          - min_node_count       = 0 -> null
          - total_max_node_count = 0 -> null
          - total_min_node_count = 0 -> null
        }

      - management {
          - auto_repair  = true -> null
          - auto_upgrade = false -> null
        }

      - network_config {
          - create_pod_range     = false -> null
          - enable_private_nodes = false -> null
          - pod_ipv4_cidr_block  = "10.8.0.0/14" -> null
          - pod_range            = "gke-pangeo-hubs-cluster-pods-14554e9f" -> null
        }

      - node_config {
          - disk_size_gb      = 100 -> null
          - disk_type         = "pd-balanced" -> null
          - guest_accelerator = [] -> null
          - image_type        = "COS_CONTAINERD" -> null
          - labels            = {
              - "k8s.dask.org/node-purpose" = "worker"
            } -> null
          - local_ssd_count   = 0 -> null
          - logging_variant   = "DEFAULT" -> null
          - machine_type      = "n1-standard-4" -> null
          - metadata          = {
              - "disable-legacy-endpoints" = "true"
            } -> null
          - oauth_scopes      = [
              - "https://www.googleapis.com/auth/cloud-platform",
            ] -> null
          - preemptible       = true -> null
          - resource_labels   = {} -> null
          - service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com" -> null
          - spot              = false -> null
          - tags              = [] -> null
          - taint             = [
              - {
                  - effect = "NO_SCHEDULE"
                  - key    = "k8s.dask.org_dedicated"
                  - value  = "worker"
                },
            ] -> null

          - shielded_instance_config {
              - enable_integrity_monitoring = true -> null
              - enable_secure_boot          = false -> null
            }

          - workload_metadata_config {
              - mode = "GKE_METADATA" -> null
            }
        }

      - upgrade_settings {
          - max_surge       = 1 -> null
          - max_unavailable = 0 -> null
          - strategy        = "SURGE" -> null
        }
    }

  # google_container_node_pool.dask_worker["worker"] will be created
  + resource "google_container_node_pool" "dask_worker" {
      + cluster                     = "pangeo-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "dask-worker"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "pangeo-integration-te-3eea"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "k8s.dask.org/node-purpose" = "worker"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = true
          + service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "k8s.dask.org_dedicated"
                  + value  = "worker"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 2 to add, 0 to change, 4 to destroy.

Changes to Outputs:
  ~ regular_channel_latest_k8s_versions = {
      ~ "1."    = "1.27.2-gke.1200" -> "1.27.3-gke.1700"
      - "1.22." = "1.22.17-gke.11400"
      - "1.23." = "1.23.17-gke.5600"
      ~ "1.24." = "1.24.13-gke.2500" -> "1.24.15-gke.1700"
      ~ "1.25." = "1.25.9-gke.2300" -> "1.25.11-gke.1700"
      + "1.26." = "1.26.6-gke.1700"
      + "1.27." = "1.27.3-gke.1700"
    }

@consideRatio
Copy link
Contributor Author

Thank you @GeorgianaElena for working this!!!

Hmmm, I don't get why is the core node pool replaced. Only the dask worker node pools are meant to be. If you can get only the dask-worker node pools destroyed, this can be applied in my mind.

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 25, 2023

If you checkout the master branch, does terraform apply cause a replacement of the core node pools as well? It could be that we have some state mismatch unrelated to this PRs change.

Ah... that is the case!

I see the core node pool type is n2-highmem-8 but our config sais n2-highmem-4. Maybe you can try by setting it to n2-highmem-8 instead then?

@GeorgianaElena
Copy link
Member

The plan looks to be updating dask nodepools now, as expected:


Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  - destroy

Terraform will perform the following actions:

  # google_container_node_pool.dask_worker["large"] will be destroyed
  # (because key ["large"] is not in for_each map)
  - resource "google_container_node_pool" "dask_worker" {
      - cluster                     = "pangeo-hubs-cluster" -> null
      - id                          = "projects/pangeo-integration-te-3eea/locations/us-central1-b/clusters/pangeo-hubs-cluster/nodePools/dask-large" -> null
      - initial_node_count          = 0 -> null
      - instance_group_urls         = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroupManagers/gke-pangeo-hubs-cluster-dask-large-0a156e10-grp",
        ] -> null
      - location                    = "us-central1-b" -> null
      - managed_instance_group_urls = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-dask-large-0a156e10-grp",
        ] -> null
      - max_pods_per_node           = 110 -> null
      - name                        = "dask-large" -> null
      - node_count                  = 0 -> null
      - node_locations              = [
          - "us-central1-b",
        ] -> null
      - project                     = "pangeo-integration-te-3eea" -> null
      - version                     = "1.26.4-gke.1400" -> null

      - autoscaling {
          - location_policy      = "ANY" -> null
          - max_node_count       = 100 -> null
          - min_node_count       = 0 -> null
          - total_max_node_count = 0 -> null
          - total_min_node_count = 0 -> null
        }

      - management {
          - auto_repair  = true -> null
          - auto_upgrade = false -> null
        }

      - network_config {
          - create_pod_range     = false -> null
          - enable_private_nodes = false -> null
          - pod_ipv4_cidr_block  = "10.8.0.0/14" -> null
          - pod_range            = "gke-pangeo-hubs-cluster-pods-14554e9f" -> null
        }

      - node_config {
          - disk_size_gb      = 100 -> null
          - disk_type         = "pd-balanced" -> null
          - guest_accelerator = [] -> null
          - image_type        = "COS_CONTAINERD" -> null
          - labels            = {
              - "k8s.dask.org/node-purpose" = "worker"
            } -> null
          - local_ssd_count   = 0 -> null
          - logging_variant   = "DEFAULT" -> null
          - machine_type      = "n1-standard-16" -> null
          - metadata          = {
              - "disable-legacy-endpoints" = "true"
            } -> null
          - oauth_scopes      = [
              - "https://www.googleapis.com/auth/cloud-platform",
            ] -> null
          - preemptible       = true -> null
          - resource_labels   = {} -> null
          - service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com" -> null
          - spot              = false -> null
          - tags              = [] -> null
          - taint             = [
              - {
                  - effect = "NO_SCHEDULE"
                  - key    = "k8s.dask.org_dedicated"
                  - value  = "worker"
                },
            ] -> null

          - shielded_instance_config {
              - enable_integrity_monitoring = true -> null
              - enable_secure_boot          = false -> null
            }

          - workload_metadata_config {
              - mode = "GKE_METADATA" -> null
            }
        }

      - upgrade_settings {
          - max_surge       = 1 -> null
          - max_unavailable = 0 -> null
          - strategy        = "SURGE" -> null
        }
    }

  # google_container_node_pool.dask_worker["medium"] will be destroyed
  # (because key ["medium"] is not in for_each map)
  - resource "google_container_node_pool" "dask_worker" {
      - cluster                     = "pangeo-hubs-cluster" -> null
      - id                          = "projects/pangeo-integration-te-3eea/locations/us-central1-b/clusters/pangeo-hubs-cluster/nodePools/dask-medium" -> null
      - initial_node_count          = 0 -> null
      - instance_group_urls         = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroupManagers/gke-pangeo-hubs-cluster-dask-medium-552f8a1e-grp",
        ] -> null
      - location                    = "us-central1-b" -> null
      - managed_instance_group_urls = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-dask-medium-552f8a1e-grp",
        ] -> null
      - max_pods_per_node           = 110 -> null
      - name                        = "dask-medium" -> null
      - node_count                  = 0 -> null
      - node_locations              = [
          - "us-central1-b",
        ] -> null
      - project                     = "pangeo-integration-te-3eea" -> null
      - version                     = "1.26.4-gke.1400" -> null

      - autoscaling {
          - location_policy      = "ANY" -> null
          - max_node_count       = 100 -> null
          - min_node_count       = 0 -> null
          - total_max_node_count = 0 -> null
          - total_min_node_count = 0 -> null
        }

      - management {
          - auto_repair  = true -> null
          - auto_upgrade = false -> null
        }

      - network_config {
          - create_pod_range     = false -> null
          - enable_private_nodes = false -> null
          - pod_ipv4_cidr_block  = "10.8.0.0/14" -> null
          - pod_range            = "gke-pangeo-hubs-cluster-pods-14554e9f" -> null
        }

      - node_config {
          - disk_size_gb      = 100 -> null
          - disk_type         = "pd-balanced" -> null
          - guest_accelerator = [] -> null
          - image_type        = "COS_CONTAINERD" -> null
          - labels            = {
              - "k8s.dask.org/node-purpose" = "worker"
            } -> null
          - local_ssd_count   = 0 -> null
          - logging_variant   = "DEFAULT" -> null
          - machine_type      = "n1-standard-8" -> null
          - metadata          = {
              - "disable-legacy-endpoints" = "true"
            } -> null
          - oauth_scopes      = [
              - "https://www.googleapis.com/auth/cloud-platform",
            ] -> null
          - preemptible       = true -> null
          - resource_labels   = {} -> null
          - service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com" -> null
          - spot              = false -> null
          - tags              = [] -> null
          - taint             = [
              - {
                  - effect = "NO_SCHEDULE"
                  - key    = "k8s.dask.org_dedicated"
                  - value  = "worker"
                },
            ] -> null

          - shielded_instance_config {
              - enable_integrity_monitoring = true -> null
              - enable_secure_boot          = false -> null
            }

          - workload_metadata_config {
              - mode = "GKE_METADATA" -> null
            }
        }

      - upgrade_settings {
          - max_surge       = 1 -> null
          - max_unavailable = 0 -> null
          - strategy        = "SURGE" -> null
        }
    }

  # google_container_node_pool.dask_worker["small"] will be destroyed
  # (because key ["small"] is not in for_each map)
  - resource "google_container_node_pool" "dask_worker" {
      - cluster                     = "pangeo-hubs-cluster" -> null
      - id                          = "projects/pangeo-integration-te-3eea/locations/us-central1-b/clusters/pangeo-hubs-cluster/nodePools/dask-small" -> null
      - initial_node_count          = 0 -> null
      - instance_group_urls         = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroupManagers/gke-pangeo-hubs-cluster-dask-small-ab203ba0-grp",
        ] -> null
      - location                    = "us-central1-b" -> null
      - managed_instance_group_urls = [
          - "https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-dask-small-ab203ba0-grp",
        ] -> null
      - max_pods_per_node           = 110 -> null
      - name                        = "dask-small" -> null
      - node_count                  = 0 -> null
      - node_locations              = [
          - "us-central1-b",
        ] -> null
      - project                     = "pangeo-integration-te-3eea" -> null
      - version                     = "1.26.4-gke.1400" -> null

      - autoscaling {
          - location_policy      = "ANY" -> null
          - max_node_count       = 100 -> null
          - min_node_count       = 0 -> null
          - total_max_node_count = 0 -> null
          - total_min_node_count = 0 -> null
        }

      - management {
          - auto_repair  = true -> null
          - auto_upgrade = false -> null
        }

      - network_config {
          - create_pod_range     = false -> null
          - enable_private_nodes = false -> null
          - pod_ipv4_cidr_block  = "10.8.0.0/14" -> null
          - pod_range            = "gke-pangeo-hubs-cluster-pods-14554e9f" -> null
        }

      - node_config {
          - disk_size_gb      = 100 -> null
          - disk_type         = "pd-balanced" -> null
          - guest_accelerator = [] -> null
          - image_type        = "COS_CONTAINERD" -> null
          - labels            = {
              - "k8s.dask.org/node-purpose" = "worker"
            } -> null
          - local_ssd_count   = 0 -> null
          - logging_variant   = "DEFAULT" -> null
          - machine_type      = "n1-standard-4" -> null
          - metadata          = {
              - "disable-legacy-endpoints" = "true"
            } -> null
          - oauth_scopes      = [
              - "https://www.googleapis.com/auth/cloud-platform",
            ] -> null
          - preemptible       = true -> null
          - resource_labels   = {} -> null
          - service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com" -> null
          - spot              = false -> null
          - tags              = [] -> null
          - taint             = [
              - {
                  - effect = "NO_SCHEDULE"
                  - key    = "k8s.dask.org_dedicated"
                  - value  = "worker"
                },
            ] -> null

          - shielded_instance_config {
              - enable_integrity_monitoring = true -> null
              - enable_secure_boot          = false -> null
            }

          - workload_metadata_config {
              - mode = "GKE_METADATA" -> null
            }
        }

      - upgrade_settings {
          - max_surge       = 1 -> null
          - max_unavailable = 0 -> null
          - strategy        = "SURGE" -> null
        }
    }

  # google_container_node_pool.dask_worker["worker"] will be created
  + resource "google_container_node_pool" "dask_worker" {
      + cluster                     = "pangeo-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 0
      + instance_group_urls         = (known after apply)
      + location                    = "us-central1-b"
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "dask-worker"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "pangeo-integration-te-3eea"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 0
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "k8s.dask.org/node-purpose" = "worker"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = true
          + service_account   = "pangeo-hubs-cluster-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "k8s.dask.org_dedicated"
                  + value  = "worker"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 1 to add, 0 to change, 3 to destroy.

Changes to Outputs:
  ~ regular_channel_latest_k8s_versions = {
      ~ "1."    = "1.27.2-gke.1200" -> "1.27.3-gke.1700"
      - "1.22." = "1.22.17-gke.11400"
      - "1.23." = "1.23.17-gke.5600"
      ~ "1.24." = "1.24.13-gke.2500" -> "1.24.15-gke.1700"
      ~ "1.25." = "1.25.9-gke.2300" -> "1.25.11-gke.1700"
      + "1.26." = "1.26.6-gke.1700"
      + "1.27." = "1.27.3-gke.1700"
    }

@github-actions
Copy link

Merging this PR will trigger the following deployment actions.

Support and Staging deployments

Cloud Provider Cluster Name Upgrade Support? Reason for Support Redeploy Upgrade Staging? Reason for Staging Redeploy
aws catalystproject-africa No Yes Core infrastructure has been modified
aws smithsonian No Yes Core infrastructure has been modified
kubeconfig utoronto No Yes Core infrastructure has been modified
aws openscapes No Yes Core infrastructure has been modified
aws carbonplan No Yes Core infrastructure has been modified
aws ubc-eoas No Yes Core infrastructure has been modified
gcp qcl No Yes Core infrastructure has been modified
gcp linked-earth No Yes Core infrastructure has been modified
aws jupyter-meets-the-earth No Yes Core infrastructure has been modified
gcp awi-ciroh No Yes Core infrastructure has been modified
aws 2i2c-aws-us No Yes Core infrastructure has been modified
gcp 2i2c-uk No Yes Core infrastructure has been modified
gcp m2lines No Yes Core infrastructure has been modified
aws gridsst No Yes Core infrastructure has been modified
gcp 2i2c No Yes Core infrastructure has been modified
aws nasa-cryo No Yes Core infrastructure has been modified
gcp leap No Yes Core infrastructure has been modified
gcp catalystproject-latam No Yes Core infrastructure has been modified
aws victor No Yes Core infrastructure has been modified
gcp callysto No Yes Core infrastructure has been modified
gcp meom-ige No Yes Core infrastructure has been modified
gcp pangeo-hubs No Yes Core infrastructure has been modified
aws nasa-veda No Yes Core infrastructure has been modified
gcp cloudbank No Yes Core infrastructure has been modified
aws nasa-ghg No Yes Core infrastructure has been modified

Production deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
aws smithsonian prod Core infrastructure has been modified
kubeconfig utoronto prod Core infrastructure has been modified
kubeconfig utoronto r-prod Core infrastructure has been modified
aws openscapes prod Core infrastructure has been modified
aws carbonplan prod Core infrastructure has been modified
aws ubc-eoas prod Core infrastructure has been modified
gcp qcl prod Core infrastructure has been modified
gcp linked-earth prod Core infrastructure has been modified
aws jupyter-meets-the-earth prod Core infrastructure has been modified
gcp awi-ciroh prod Core infrastructure has been modified
aws 2i2c-aws-us researchdelight Core infrastructure has been modified
aws 2i2c-aws-us ncar-cisl Core infrastructure has been modified
aws 2i2c-aws-us go-bgc Core infrastructure has been modified
aws 2i2c-aws-us itcoocean Core infrastructure has been modified
gcp 2i2c-uk lis Core infrastructure has been modified
gcp m2lines prod Core infrastructure has been modified
aws gridsst prod Core infrastructure has been modified
gcp 2i2c hackanexoplanet Core infrastructure has been modified
gcp 2i2c imagebuilding-demo Core infrastructure has been modified
gcp 2i2c demo Core infrastructure has been modified
gcp 2i2c ohw Core infrastructure has been modified
gcp 2i2c pfw Core infrastructure has been modified
gcp 2i2c aup Core infrastructure has been modified
gcp 2i2c temple Core infrastructure has been modified
gcp 2i2c ucmerced Core infrastructure has been modified
gcp 2i2c cosmicds Core infrastructure has been modified
gcp 2i2c climatematch Core infrastructure has been modified
gcp 2i2c neurohackademy Core infrastructure has been modified
gcp 2i2c mtu Core infrastructure has been modified
aws nasa-cryo prod Core infrastructure has been modified
gcp leap prod Core infrastructure has been modified
gcp catalystproject-latam unitefa-conicet Core infrastructure has been modified
aws victor prod Core infrastructure has been modified
gcp callysto prod Core infrastructure has been modified
gcp meom-ige prod Core infrastructure has been modified
gcp pangeo-hubs prod Core infrastructure has been modified
gcp pangeo-hubs coessing Core infrastructure has been modified
aws nasa-veda prod Core infrastructure has been modified
gcp cloudbank bcc Core infrastructure has been modified
gcp cloudbank ccsf Core infrastructure has been modified
gcp cloudbank csm Core infrastructure has been modified
gcp cloudbank dvc Core infrastructure has been modified
gcp cloudbank elcamino Core infrastructure has been modified
gcp cloudbank evc Core infrastructure has been modified
gcp cloudbank glendale Core infrastructure has been modified
gcp cloudbank howard Core infrastructure has been modified
gcp cloudbank miracosta Core infrastructure has been modified
gcp cloudbank skyline Core infrastructure has been modified
gcp cloudbank demo Core infrastructure has been modified
gcp cloudbank fresno Core infrastructure has been modified
gcp cloudbank humboldt Core infrastructure has been modified
gcp cloudbank laney Core infrastructure has been modified
gcp cloudbank sbcc Core infrastructure has been modified
gcp cloudbank lacc Core infrastructure has been modified
gcp cloudbank lamission Core infrastructure has been modified
gcp cloudbank mills Core infrastructure has been modified
gcp cloudbank mission Core infrastructure has been modified
gcp cloudbank norco Core infrastructure has been modified
gcp cloudbank palomar Core infrastructure has been modified
gcp cloudbank pasadena Core infrastructure has been modified
gcp cloudbank sjcc Core infrastructure has been modified
gcp cloudbank sacramento Core infrastructure has been modified
gcp cloudbank srjc Core infrastructure has been modified
gcp cloudbank saddleback Core infrastructure has been modified
gcp cloudbank santiago Core infrastructure has been modified
gcp cloudbank sjsu Core infrastructure has been modified
gcp cloudbank tuskegee Core infrastructure has been modified
gcp cloudbank wlac Core infrastructure has been modified
gcp cloudbank csulb Core infrastructure has been modified
aws nasa-ghg prod Core infrastructure has been modified

@GeorgianaElena
Copy link
Member

@consideRatio, I still don't see any dask nodes running. Ready to do a terraform apply?

@consideRatio
Copy link
Contributor Author

Wieee yes!!!

@GeorgianaElena
Copy link
Member

🚀 🚀 🚀

Screenshot 2023-08-25 at 12 00 13

@GeorgianaElena GeorgianaElena self-assigned this Aug 25, 2023
@GeorgianaElena
Copy link
Member

Feel free to merge whenever you are ready @consideRatio 🚀

@consideRatio consideRatio merged commit c0d2e5c into 2i2c-org:master Aug 25, 2023
@consideRatio
Copy link
Contributor Author

Thank you @GeorgianaElena!!!!!! Massive help!

@github-actions
Copy link

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/5973999496

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

Transition to using only a single node pool for dask-gateway workers (16 CPU, highmem)
2 participants