Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nfs-server-provisioner subchart in support #613

Merged
merged 17 commits into from
Aug 30, 2021

Conversation

sgibson91
Copy link
Member

@sgibson91 sgibson91 commented Aug 17, 2021

This PR adds the nfs-server-provisioner chart as a dependency of support and allows PVCs to be provisioned as an nfs type. We would no longer need to manually provision an NFS server.

TODOs

  • Figure out a cleaner install procedure and document it

New issues:

Fixes #50

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thank you, @sgibson91 and @consideRatio!

I've a few questions about how the data is stored longer term.

  1. What are the common ways data loss can occur? Does deleting the PV delete the backing disk?
  2. Can we resize them later if required? If so, how?
  3. Will we have one disk per hub or one per cluster?
  4. How can we do backups?

Losing user home dirs is always the worst case scenario, so I just want to make sure we understand what is happening here.

None of these need to block this PR, though.

support/values.yaml Outdated Show resolved Hide resolved
support/Chart.yaml Show resolved Hide resolved
Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion, but please merge after that! Thank you :)

support/values.yaml Outdated Show resolved Hide resolved
@sgibson91
Copy link
Member Author

I'm noticing some circular dependencies between running python deployer deploy and python deployer deploy-support. With NFS enabled, the support chart needs to be deployed first so the hub-dir-db can be granted an NFS pvc, but that leaves the grafana and prometheus pods pending and hence the install fails. Similarly, deploying the *hub chart first leaves the hub pod in a pending state. So I currrently don't have a way of deploying either the support chart or hub-template chart first without a failure.

Possible solutions:

  • Give all support chart dependencies an condition: <chart name>.enabled flag so we can control the order of deployment
  • Something else?

@yuvipanda
Copy link
Member

but that leaves the grafana and prometheus pods pending and hence the install fails.

What are they waiting on, do you know? At first glance I don't know what there would be dependent on a hub being deployed.

Separately, only the user home directories should be on NFS. Putting our hub db on NFS will lead to poor performance, since sqlite (which is what we use with our hub) works poorly on NFS.

@choldgraf
Copy link
Member

I have a meta question - this PR is making this NFS behavior true for all of our hubs, not just for Pangeo, right? If so, is there a place in pilot-hubs where this new functionality could be documented? I worry that, 3 months from now, it will be hard to reverse-understand how our NFS stuff is being provisioned without some digging.

Does it make sense to have a reference section that simply describes the major components of our hub template? (e.g. "where does XXX come from?" "where is YYY configured?") or is it enough to just use lots of comments in our YAML?

@sgibson91
Copy link
Member Author

this PR is making this NFS behavior true for all of our hubs, not just for Pangeo, right?

Not quite true. It has to be enabled and is disabled by default (see line 37 in support/Chart.yaml - doing links to lines on the mobile app is not easy!). So other hubs shouldn't be impacted if/when this is merged.

@choldgraf
Copy link
Member

Ah that is good to know! But I think that is actually another good example of something that we'd ideally have reference documentation on. E.g. how would somebody unfamiliar with this infrastructure quickly learn how NFS is provisioned, what the default behavior is, etc? Since it is disabled by default, how does somebody know when to enable it?

@sgibson91
Copy link
Member Author

All good questions that I'm trying to figure out myself :)

@choldgraf choldgraf requested a review from consideRatio August 19, 2021 19:23
@choldgraf
Copy link
Member

@consideRatio you mentioned that you had been doing a dive into NFS stuff lately. Would you mind taking a look at @yuvipanda's questions and my questions above? Would you be willing to help advise on what kind of documentation would best-capture this? (or, whether there should be upstream documentation improvements for this?)

@sgibson91
Copy link
Member Author

What do you think about tracking the "clean install" story in a new issue before this PR gets too unwieldy? It might also help distinguish between "technically blocked because something doesn't work" and "something needs more discussion/process/documenting"

@yuvipanda
Copy link
Member

@sgibson91 +1 on opening a new clean issue.

have basehub.jupyterhub.proxy.https.enabled = false, but I'm wondering if there's anything in daskhub that's trying to talk to ingress-nginx that needs disabling?

Anything from kubectl get ingress -A? Might show us if ingress objects are being added.

@sgibson91
Copy link
Member Author

Anything from kubectl get ingress -A? Might show us if ingress objects are being added.

$ kubectl get ingress -A
No resources found

@consideRatio
Copy link
Contributor

consideRatio commented Aug 23, 2021

Oh I had to allow 6443 before, but it seems now that they have changed the port.

  # The port that the webhook should listen on for requests.
  # In GKE private clusters, by default kubernetes apiservers are allowed to
  # talk to the cluster nodes only on 443 and 10250. so configuring
  # securePort: 10250, will work out of the box without needing to add firewall
  # rules or requiring NET_BIND_SERVICE capabilities to bind port numbers <1000
  securePort: 10250

@sgibson91 I believe that cert-manager may give you trouble if you try to disable the webhook in modern versions. What version are you installing? You pointed documentation about version 0.9 which is very old at this point (on Jul 23, 2019).


From the error, I believe that the k8s api-server tries to communicate with the cert-manager webhook using HTTPS and requires the webhook to have a known certificate to communicate with the k8s api-server via HTTPS. I think this may be something that is setup via the creation of CertificateSigningRequest resources or similar, and that it may be something that requires some time to pass before it works.

When you have things fail with an enabled webhook, it may be because not enough time has elapsed for cert-manager to properly get up and running. I'm not at all sure though.

@sgibson91
Copy link
Member Author

Found the same info in v1.5: https://cert-manager.io/docs/concepts/webhook/#webhook-connection-problems-on-gke-private-cluster

Version being installed: https://github.com/2i2c-org/pilot-hubs/blob/386d9a0b351878d693d7fd8a500222e726a90792/deployer/hub.py#L84

I would reenable it after I'd deployed the hub and setup a DNS record. The whole sticking point of this PR is that other stuff in the support chart is trying to do it's thing before I'm ready for it because I don't have a hub up yet.

What I want to do is the following:

  1. Deploy nfs-server-provisioner
  2. Deploy a hub
  3. Deploy other support components, i.e. prometheus, grafana, ingress-nginx, setup the DNS and then let cert-manager do it's thing.

But at the minute, helm seems to be complaining about not finding things that haven't been installed yet and I'm not sure what part of basehub/daskhub/whatever is making those calls.

@sgibson91
Copy link
Member Author

Deleting cluster, starting again from scratch. Only condition should be to run deploy-support and then deploy.

@sgibson91
Copy link
Member Author

So on a fresh cluster:

The whole of the support chart installed fine (no disabling of any sub-charts, just enabled nfs-server-provisioner)

Then tried to deploy the hub and got:

Error: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://support-ingress-nginx-controller-admission.support.svc:443/networking/v1beta1/ingresses?timeout=10s": context deadline exceeded

But all pods came up successfully

@consideRatio
Copy link
Contributor

consideRatio commented Aug 23, 2021

Hmmmm? What is the support-ingress-nginx-controller-admission service in the support namespace? It seems like a k8s resource validation webhook is registered by the ingress-nginx controller as well, is that correct? I didn't know they did that.

Oh! I guess that is correct.

ingress-nginx-admission 1 8d
cert-manager-webhook 1 8d

  ports:
  - name: https-webhook
    port: 443
    protocol: TCP
    targetPort: webhook

It points to a pod's "webhook" port, which is defined to be 8443.

    - containerPort: 8443
      name: webhook
      protocol: TCP

So, you need to allow for port 8443 between the peered GKE project where the k8s api-server run and the actual GCP project we manage then it seems. Or, perhaps configure that port to be different?

@sgibson91
Copy link
Member Author

@consideRatio
Copy link
Contributor

consideRatio commented Aug 23, 2021

@sgibson91 perhaps you can update the config so this template renders to a already allowed port? Or disable it as well.

          {{- if .Values.controller.admissionWebhooks.enabled }}
            - name: webhook
              containerPort: {{ .Values.controller.admissionWebhooks.port }}
              protocol: TCP
          {{- end }}

https://github.com/kubernetes/ingress-nginx/blob/605c243d7ae49e11202ea106bebc205b45b26333/charts/ingress-nginx/templates/controller-deployment.yaml#L169-L173

@sgibson91
Copy link
Member Author

I tried adding another firewall rule to terraform as well, but that didn't seem to help

resource "google_compute_firewall" "ingress_nginx_webhook_validation_ingress" {
  count = var.enable_private_cluster ? 1 : 0

  name    = "ingress-ngix-webhook-validation"
  project = var.project_id
  network = data.google_compute_network.default_network.name

  allow {
    protocol = "tcp"
    ports    = ["8443"]
  }

  // This range contains all IP addresses that IAP uses for TCP forwarding.
  // https://cloud.google.com/iap/docs/using-tcp-forwarding
  source_ranges = ["35.235.240.0/20"]
}

@yuvipanda
Copy link
Member

@sgibson91
Copy link
Member Author

sgibson91 commented Aug 23, 2021

Disabling the admissions webhook seemed to work. Got a hub in CrashLoopBackOff though

  Warning  Unhealthy               61s (x12 over 83s)  kubelet                  Readiness probe failed: Get "http://10.8.0.32:8081/hub/health": dial tcp 10.8.0.32:8081: connect: connection refused

Probably need to allow traffic to/from port 8081 as well

Update: Literally as I typed this it resolved itself

@sgibson91
Copy link
Member Author

I disabled the webhook over here in #597

@sgibson91
Copy link
Member Author

*collapses* https://staging.pangeo.2i2c.cloud/

@choldgraf choldgraf mentioned this pull request Aug 23, 2021
2 tasks
accessMode: ReadWriteOnce
size: 100Gi
# Future option is to reference an xfs storage class. This will allow us to enable quotas.
# https://github.com/pangeo-data/pangeo-cloud-federation/issues/654#issuecomment-861771398
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On most cloud providers, the storageClass is a reference to the underlying physical storage only - spinning disks or SSDs. I think the nfs external provisioner already uses XFS, and so the storageClass wouldn't come up. I think the reference in the provided link is for bare metal instances only.

Since we also use out-of-cluster NFS in a few places,
let's name this explicitly.
@yuvipanda
Copy link
Member

I made some minor changes, this is good to go!

@yuvipanda yuvipanda merged commit c1d06be into 2i2c-org:master Aug 30, 2021
@yuvipanda
Copy link
Member

Great work, @consideRatio and @sgibson91!

@choldgraf
Copy link
Member

Amazing! thanks everybody for collaborating on this and getting it through!

I've updated #629 to track documenting this configuration, so that we don't lose it. I think we should document it relatively soon while the information is still fresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run NFS servers in-cluster
4 participants