Cluster architecture - core node pools in any zone, or in the user zones? #2769

consideRatio · 2023-07-07T07:59:39Z

In the terraform/gcp configuration, we provide node_locations for user nodes but not for the core nodes. That means that they will start in any zone in the regional cluster I think. In practice, this makes us able to end up with core nodes in another zone than the user nodes.

I suspect this is a bit inefficient, but perhaps not a big deal either. Are there costs of zone-to-zone networking etc that we want to avoid?

Changing this in the common terraform config may require re-creating the core nodes or similar, so I figure the only path towards configuring the core nodes to the zone(s) of the user nodes would have to be done cluster by cluster with a cluster-specific override until we can do it systematically in the common config.

The text was updated successfully, but these errors were encountered:

yuvipanda · 2023-07-07T16:51:20Z

That means that they will start in any zone in the regional cluster I think.

This should not be true, as they should instead inherit the default from the cluster's node_locations:

infrastructure/terraform/gcp/cluster.tf

Line 55 in 1aac5c8

node_locations = var.regional_cluster ? [var.zone] : null

. I'll investigate the recent issues again to make sure this is the case.

consideRatio · 2023-07-07T17:00:09Z

@yuvipanda ah, looking at a newly created cluster without explicitly configuring node_locations for the core node pool I conclude you are right.

I draw the wrong conclusion from seeing how they were explicitly configured for the user nodes, but they are explicit there because they were changed over time perhaps?

If zones is not explicitly set for nodepools, it will inherit whatever is set for the cluster itself. This makes the code clearer so that is more obvious. Fixes 2i2c-org#2769

yuvipanda · 2023-07-07T17:02:17Z

In the latam cluster just created, I see:

$  terraform state show google_container_node_pool.core
# google_container_node_pool.core:
resource "google_container_node_pool" "core" {
    cluster                     = "latam-cluster"
    id                          = "projects/catalystproject-392106/locations/southamerica-east1/clusters/latam-cluster/nodePools/core-pool"
    initial_node_count          = 1
    instance_group_urls         = [
        "https://www.googleapis.com/compute/v1/projects/catalystproject-392106/zones/southamerica-east1-c/instanceGroupManagers/gke-latam-cluster-core-pool-01cfc23e-grp",
    ]
    location                    = "southamerica-east1"
    managed_instance_group_urls = [
        "https://www.googleapis.com/compute/v1/projects/catalystproject-392106/zones/southamerica-east1-c/instanceGroups/gke-latam-cluster-core-pool-01cfc23e-grp",
    ]
    name                        = "core-pool"
    node_count                  = 2
    node_locations              = [
        "southamerica-east1-c",
    ]
    project                     = "catalystproject-392106"
    version                     = "1.27.2-gke.1200"

    autoscaling {
        location_policy      = "BALANCED"
        max_node_count       = 5
        min_node_count       = 1
        total_max_node_count = 0
        total_min_node_count = 0
    }

    management {
        auto_repair  = true
        auto_upgrade = false
    }

    network_config {
        create_pod_range     = false
        enable_private_nodes = false
    }

    node_config {
        disk_size_gb      = 30
        disk_type         = "pd-balanced"
        guest_accelerator = []
        image_type        = "COS_CONTAINERD"
        labels            = {
            "hub.jupyter.org/node-purpose" = "core"
            "k8s.dask.org/node-purpose"    = "core"
        }
        local_ssd_count   = 0
        logging_variant   = "DEFAULT"
        machine_type      = "n2-highmem-2"
        metadata          = {
            "disable-legacy-endpoints" = "true"
        }
        oauth_scopes      = [
            "https://www.googleapis.com/auth/cloud-platform",
        ]
        preemptible       = false
        resource_labels   = {}
        service_account   = "[email protected]"
        spot              = false
        tags              = []
        taint             = []

        shielded_instance_config {
            enable_integrity_monitoring = true
            enable_secure_boot          = false
        }

        workload_metadata_config {
            mode = "GKE_METADATA"
        }
    }

    upgrade_settings {
        max_surge       = 1
        max_unavailable = 0
        strategy        = "SURGE"
    }
}

yuvipanda · 2023-07-07T17:03:17Z

@consideRatio they're set that way so we could override them when necessary, as we do for GPU nodes in LEAP - the single zone was running out of GPUs often.

I opened #2777 to make this clearer in the code.

github-project-automation bot added this to DEPRECATED Engineering and Product Backlog Jul 7, 2023

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Jul 7, 2023

consideRatio added the tech:terraform label Jul 7, 2023

consideRatio mentioned this issue Jul 7, 2023

terraform, gcp: node pool creation followup issue diagnosis and resolution #2768

Closed

yuvipanda mentioned this issue Jul 7, 2023

Clarify when node_locations is overriden #2777

Merged

consideRatio assigned yuvipanda Jul 7, 2023

yuvipanda closed this as completed in #2777 Jul 8, 2023

github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Jul 8, 2023

damianavila added this to Sprint Board Jul 11, 2023

damianavila moved this to Done 🎉 in Sprint Board Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster architecture - core node pools in any zone, or in the user zones? #2769

Cluster architecture - core node pools in any zone, or in the user zones? #2769

consideRatio commented Jul 7, 2023

yuvipanda commented Jul 7, 2023

consideRatio commented Jul 7, 2023

yuvipanda commented Jul 7, 2023

yuvipanda commented Jul 7, 2023

Cluster architecture - core node pools in any zone, or in the user zones? #2769

Cluster architecture - core node pools in any zone, or in the user zones? #2769

Comments

consideRatio commented Jul 7, 2023

yuvipanda commented Jul 7, 2023

consideRatio commented Jul 7, 2023

yuvipanda commented Jul 7, 2023

yuvipanda commented Jul 7, 2023