-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling private nodes for GKE clusters #538
Conversation
I'm hoping this preserves kubectl access for our local machines and GitHub Actions
I have been trying this out on #489. First speed bump is that
What happens internally and can we copy that? May be able to steal some defaults from this PR |
Other q's to consider:
|
Looks like we are Routes based by default from the terraform config: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#networking_mode If that is a switch we want, it will need to be behind a flag as it will force-recreate our other GCP clusters |
@yuvipanda recommended watching https://www.youtube.com/watch?v=fI-5LkBDap8 We will definitely need a Cloud NAT |
Ok, here's how I think I'm going to approach this:
|
I have a simple terraform config that successfully deploys a private cluster with VPC and CloudNAT to the Based off much of the terraform in this tutorial:https://cloud.google.com/nat/docs/gke-example#terraform |
Using terraform v1.0 allows us to use the count parameter in module definitions
For private clusters, we dynamically provision the following attribute blocks: - private_cluster_config - ip_allocation_policy - Parse network and subnetwork from the VPC
Deploy a VPC and Cloud NAT with router and a firewall rule to allow SSH
…to tf-private-nodes
The terraform config in this PR coupled with that in #489 successfully deployed to $ gcloud container clusters get-credentials pangeo-hubs-cluster
Fetching cluster endpoint and auth data.
ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=403, message=Required "container.clusters.get" permission(s) for "projects/pangeo-integration-te-3eea/zones/us-central1-b/clusters/pangeo-hubs-cluster". |
This was a combination of unhelpful error message and me being unobservant. trying to pull credentials for a cluster called I redeployed this morning, paying attention to my prefixes, and now I've got credentials for |
So it seems that things are being changes to the deployment after running Outside Terraform changes
|
I've encountered this before too. It's always been either things getting set to empty values (like |
I think we may as well get this reviewed / merged while I think about the sops issue for deploying a hub. This coupled with #489 has already been deployed to the Pangeo project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I left some minor comments about style.
Each GCP Project comes with a 'default network', and that's where our GKE clusters are launched into right now. Can we continue to use this? This also removes a lot of decisions from our hand, like what IP ranges to use - those can just be the defaults. We'll just need to add the cloud NAT. Terraform data
blocks might be needed here?
If we can't use the default network (and its subnetworks), we need to figure out:
- What are the sizes of the various IP ranges we pick for pod, node and service IPs? Picking the defaults seem ok, but let's make sure we know how many pods, nodes, and services we can support with those choices.
- If we change those values in the future, does it recreate the entire cluster?
Thanks a lot for working on this, @sgibson91!!!
|
||
content { | ||
// Decide if this CIDR block is sensible or not | ||
master_ipv4_cidr_block = "172.16.0.0/28" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave this unset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alas, no :( #538 (comment)
Co-authored-by: Yuvi Panda <[email protected]>
Replaces creating our own network and subnetwork. This means we don't need to worry about address spaces for pods/services/etc
I think I have achieved this in |
source_ranges = ["35.235.240.0/20"] | ||
} | ||
|
||
resource "google_compute_router" "router" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to see if there exists a default router we can use too, and based on reading the docs and running gcloud compute routers list
I've come to the conclusion we have to create our own. So this LGTM
I've updated the docstring for |
Amazing! 🎉 |
Summary
This PR enables private nodes for our GKE clusters in the terraform config and also provisions a VPC and CloudNAT as network support. This is mostly to comply with organisational constraints discovered in #489, and prepare for any future collaborator that may have similar requirements. I learned a lot from this tutorial when designing this config.
Acceptance Criteria
kubectl
andgcloud compute ssh
access must be preserved for engineer's local machines and our CI/CD workflowsTests
kubeconfig
entry fetched successfully?kubectl get nodes
successful?gcloud compute ssh NODE_NAME
successful?two-eye-two-see-sandbox
two-eye-two-see-sandbox
pangeo-integration-te-3eea
*Note: public config cannot be tested on
pangeo-integration-te-3eea
as this violates the controls in place on the tenancyTopics to Discuss
private_cluster_config
block. I used values from the tutorial (linked above). Should we change them? If so, what to?private_cluster_config
:https://github.com/2i2c-org/pilot-hubs/blob/c5bd53d5604eeb0ea131000b42065ffe44009395/terraform/cluster.tf#L23-L24
Seems like a lot of stuff is getting changed after deployment by something (see this comment). We need to understand where these changes are coming from and what to do about them, as I believe it will cause a force-replacement of the cluster if applied without using theUpdate: I think I have fixed this by no longer relying on theignore_changes
meta-argument.network
andcloudnat
terraform modules.What should a reviewer concentrate their feedback on?