Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup nodepool for neurohackademy #2758

Merged
merged 16 commits into from
Jul 6, 2023

Conversation

sgibson91
Copy link
Member

Reconstruction/reversion of PR #1726

Working towards #2681

Reconstruction/reversion of PR 2i2c-org#1726
@sgibson91 sgibson91 requested a review from a team as a code owner July 4, 2023 11:45
@sgibson91
Copy link
Member Author

terraform plan output under the fold. I seemed to have picked up some climatematch related changes which could either be related to #2757 or just an out-of-date state. I will stash my changes and try a refresh only plan first.

tf plan output ``` Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: + create ~ update in-place

Terraform will perform the following actions:

google_container_node_pool.dask_worker["worker"] will be updated in-place

~ resource "google_container_node_pool" "dask_worker" {
id = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/dask-worker"
name = "dask-worker"
~ node_locations = [
- "us-central1-b",
]
# (8 unchanged attributes hidden)

    # (4 unchanged blocks hidden)
}

google_container_node_pool.notebook["climatematch"] will be updated in-place

~ resource "google_container_node_pool" "notebook" {
id = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-climatematch"
name = "nb-climatematch"
~ node_locations = [
- "us-central1-b",
]
# (8 unchanged attributes hidden)

    # (4 unchanged blocks hidden)
}

google_container_node_pool.notebook["neurohackademy"] will be created

  • resource "google_container_node_pool" "notebook" {
    • cluster = "pilot-hubs-cluster"

    • id = (known after apply)

    • initial_node_count = 1

    • instance_group_urls = (known after apply)

    • location = (known after apply)

    • managed_instance_group_urls = (known after apply)

    • max_pods_per_node = (known after apply)

    • name = "nb-neurohackademy"

    • name_prefix = (known after apply)

    • node_count = (known after apply)

    • node_locations = (known after apply)

    • operation = (known after apply)

    • project = "two-eye-two-see"

    • version = (known after apply)

    • autoscaling {

      • location_policy = (known after apply)
      • max_node_count = 100
      • min_node_count = 1
        }
    • management {

      • auto_repair = true
      • auto_upgrade = false
        }
    • node_config {

      • disk_size_gb = (known after apply)

      • disk_type = "pd-balanced"

      • guest_accelerator = (known after apply)

      • image_type = (known after apply)

      • labels = {

        • "2i2c.org/community" = "neurohackademy"
        • "hub.jupyter.org/node-purpose" = "user"
        • "k8s.dask.org/node-purpose" = "scheduler"
          }
      • local_ssd_count = (known after apply)

      • logging_variant = "DEFAULT"

      • machine_type = "n1-highmem-n16"

      • metadata = (known after apply)

      • min_cpu_platform = (known after apply)

      • oauth_scopes = [

      • preemptible = false

      • service_account = "[email protected]"

      • spot = false

      • tags = []

      • taint = [

        • {
          • effect = "NO_SCHEDULE"
          • key = "hub.jupyter.org_dedicated"
          • value = "user"
            },
            ]
      • workload_metadata_config {

        • mode = "GKE_METADATA"
          }
          }
          }

google_container_node_pool.notebook["user"] will be updated in-place

~ resource "google_container_node_pool" "notebook" {
id = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-user"
name = "nb-user"
~ node_locations = [
- "us-central1-b",
]
# (8 unchanged attributes hidden)

    # (4 unchanged blocks hidden)
}

Plan: 1 to add, 3 to change, 0 to destroy.

Changes to Outputs:
~ regular_channel_latest_k8s_versions = {
~ "1." = "1.26.3-gke.1000" -> "1.27.2-gke.1200"
~ "1.22." = "1.22.17-gke.8000" -> "1.22.17-gke.12700"
~ "1.23." = "1.23.17-gke.2000" -> "1.23.17-gke.6800"
~ "1.24." = "1.24.13-gke.2500" -> "1.24.14-gke.1200"
~ "1.25." = "1.25.8-gke.1000" -> "1.25.10-gke.1200"
}

</details>

Copy link
Contributor

@pnasrat pnasrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Please add requested comment to refer to the issue tracking this event then I will approve

@@ -35,6 +35,15 @@ notebook_nodes = {
resource_labels : {
"community" : "climatematch"
}
},
"neurohackademy" : {
# We expect around 120 users
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment linking to the issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@sgibson91
Copy link
Member Author

sgibson91 commented Jul 4, 2023

New tf plan output below. Still seeing some unrelated changes, but I think they are non-destructive.

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # google_container_node_pool.dask_worker["worker"] will be updated in-place
  ~ resource "google_container_node_pool" "dask_worker" {
        id                          = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/dask-worker"
        name                        = "dask-worker"
      ~ node_locations              = [
          - "us-central1-b",
        ]
        # (8 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

  # google_container_node_pool.notebook["climatematch"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-climatematch"
        name                        = "nb-climatematch"
      ~ node_locations              = [
          - "us-central1-b",
        ]
        # (8 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

  # google_container_node_pool.notebook["neurohackademy"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pilot-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 1
      + instance_group_urls         = (known after apply)
      + location                    = (known after apply)
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-neurohackademy"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 1
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "2i2c.org/community"           = "neurohackademy"
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

  # google_container_node_pool.notebook["user"] will be updated in-place
  ~ resource "google_container_node_pool" "notebook" {
        id                          = "projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-user"
        name                        = "nb-user"
      ~ node_locations              = [
          - "us-central1-b",
        ]
        # (8 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

Plan: 1 to add, 3 to change, 0 to destroy.

Copy link
Contributor

@pnasrat pnasrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sgibson91
Copy link
Member Author

sgibson91 commented Jul 4, 2023

Ok, the changes to the climatematch, user and worker nodepools are causing issues for my changes

google_container_node_pool.notebook["climatematch"]: Modifying... [id=projects/two-eye-two-see/locations/us-central1-b/clusters/pilot-hubs-cluster/nodePools/nb-climatematch]
╷
│ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb'] must be specified.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.RequestInfo",
│     "requestId": "0x470707e08556526a"
│   }
│ ]
│ , badRequest
│ 
│   with google_container_node_pool.notebook["climatematch"],
│   on cluster.tf line 238, in resource "google_container_node_pool" "notebook":
│  238: resource "google_container_node_pool" "notebook" {
│ 
╵
╷
│ Error: Cannot determine zone: set in this resource, or set provider-level zone.
│ 
│   with google_container_node_pool.notebook["neurohackademy"],
│   on cluster.tf line 238, in resource "google_container_node_pool" "notebook":
│  238: resource "google_container_node_pool" "notebook" {
│ 
╵
╷
│ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb'] must be specified.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.RequestInfo",
│     "requestId": "0xcb33b6fe2284f331"
│   }
│ ]
│ , badRequest
│ 
│   with google_container_node_pool.notebook["user"],
│   on cluster.tf line 238, in resource "google_container_node_pool" "notebook":
│  238: resource "google_container_node_pool" "notebook" {
│ 
╵
╷
│ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb'] must be specified.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.RequestInfo",
│     "requestId": "0xbde1b90f3b71a878"
│   }
│ ]
│ , badRequest
│ 
│   with google_container_node_pool.dask_worker["worker"],
│   on cluster.tf line 339, in resource "google_container_node_pool" "dask_worker":
│  339: resource "google_container_node_pool" "dask_worker" {

@sgibson91
Copy link
Member Author

I have now defined the expected zones for those nodepools and the new plan output is below, which I suspect will apply cleanly now

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_container_node_pool.notebook["neurohackademy"] will be created
  + resource "google_container_node_pool" "notebook" {
      + cluster                     = "pilot-hubs-cluster"
      + id                          = (known after apply)
      + initial_node_count          = 1
      + instance_group_urls         = (known after apply)
      + location                    = (known after apply)
      + managed_instance_group_urls = (known after apply)
      + max_pods_per_node           = (known after apply)
      + name                        = "nb-neurohackademy"
      + name_prefix                 = (known after apply)
      + node_count                  = (known after apply)
      + node_locations              = (known after apply)
      + operation                   = (known after apply)
      + project                     = "two-eye-two-see"
      + version                     = (known after apply)

      + autoscaling {
          + location_policy = (known after apply)
          + max_node_count  = 100
          + min_node_count  = 1
        }

      + management {
          + auto_repair  = true
          + auto_upgrade = false
        }

      + node_config {
          + disk_size_gb      = (known after apply)
          + disk_type         = "pd-balanced"
          + guest_accelerator = (known after apply)
          + image_type        = (known after apply)
          + labels            = {
              + "2i2c.org/community"           = "neurohackademy"
              + "hub.jupyter.org/node-purpose" = "user"
              + "k8s.dask.org/node-purpose"    = "scheduler"
            }
          + local_ssd_count   = (known after apply)
          + logging_variant   = "DEFAULT"
          + machine_type      = "n2-highmem-16"
          + metadata          = (known after apply)
          + min_cpu_platform  = (known after apply)
          + oauth_scopes      = [
              + "https://www.googleapis.com/auth/cloud-platform",
            ]
          + preemptible       = false
          + service_account   = "[email protected]"
          + spot              = false
          + tags              = []
          + taint             = [
              + {
                  + effect = "NO_SCHEDULE"
                  + key    = "hub.jupyter.org_dedicated"
                  + value  = "user"
                },
            ]

          + workload_metadata_config {
              + mode = "GKE_METADATA"
            }
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

@sgibson91
Copy link
Member Author

Ok, it did not apply cleanly and I will have to define zones for the neurohackademy nodepool too. I have opened #2759 as I suspect this is a bug in how we have constructed the terraform config.

@sgibson91
Copy link
Member Author

Nope, I am still seeing the error even when I have defined a zone for the nodepool

google_container_node_pool.notebook["neurohackademy"]: Creating...
╷
│ Error: Cannot determine zone: set in this resource, or set provider-level zone.
│ 
│   with google_container_node_pool.notebook["neurohackademy"],
│   on cluster.tf line 238, in resource "google_container_node_pool" "notebook":
│  238: resource "google_container_node_pool" "notebook" {
│ 
╵

0057015 (#2758)

@pnasrat
Copy link
Contributor

pnasrat commented Jul 5, 2023

@sgibson91 I'm also seeing issues with node pools and terraform, I'll try debug some this morning.

@yuvipanda
Copy link
Member

@sgibson91 @pnasrat I've fixed this up, with the following sets of changes:

  1. 658ec6c, the primary issue here. terraform state show google_container_cluster.cluster showed me that google_container_cluster.cluster.node_locations is just always null, causing the issue we were facing.
  2. Moved back to n1-highmem, because we didn't have enough quota for n2!
  3. Added taints and resource labels for the dedicated nodepool, as per https://infrastructure.2i2c.org/howto/features/dedicated-nodepool/ (this is only a few weeks old). Without the taints, our dedicated nodepools weren't actually truly dedicated before.

I've applied this as well, so it's ready to merge.

@yuvipanda
Copy link
Member

yuvipanda commented Jul 5, 2023

The reason this was not caught as part of #2406 is that regional clusters do have google_container_cluster.cluster.node_locations defined correctly. I have validated that the leap tfvars still apply after this change.

@yuvipanda yuvipanda mentioned this pull request Jul 5, 2023
@yuvipanda
Copy link
Member

Given that the event only starts on August 7th, I've set the minimum nodepool size to 0, not 1, as otherweise we'll be spending a lot of money on n1-highmem-16 in that time period. I'll document this shortly.

@sgibson91
Copy link
Member Author

Thank you @yuvipanda!

@sgibson91 sgibson91 merged commit dc207ab into 2i2c-org:master Jul 6, 2023
@sgibson91 sgibson91 deleted the neurohackademy-nodepool branch July 6, 2023 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants