Use nfs-server-provisioner subchart in support #613

sgibson91 · 2021-08-17T14:30:25Z

This PR adds the nfs-server-provisioner chart as a dependency of support and allows PVCs to be provisioned as an nfs type. We would no longer need to manually provision an NFS server.

TODOs

Figure out a cleaner install procedure and document it

New issues:

Gather knowledge on what provisioning NFS storage in this manner means Documenting use of nfs-server-provisioner #629
Create reference documentation for the above Create reference documentation #628

Fixes #50

support/Chart.yaml

Co-authored-by: Erik Sundell <[email protected]>

yuvipanda

This is great! Thank you, @sgibson91 and @consideRatio!

I've a few questions about how the data is stored longer term.

What are the common ways data loss can occur? Does deleting the PV delete the backing disk?
Can we resize them later if required? If so, how?
Will we have one disk per hub or one per cluster?
How can we do backups?

Losing user home dirs is always the worst case scenario, so I just want to make sure we understand what is happening here.

None of these need to block this PR, though.

support/values.yaml

support/Chart.yaml

…utomate-nfs

yuvipanda

One suggestion, but please merge after that! Thank you :)

support/values.yaml

sgibson91 · 2021-08-19T14:20:05Z

I'm noticing some circular dependencies between running python deployer deploy and python deployer deploy-support. With NFS enabled, the support chart needs to be deployed first so the hub-dir-db can be granted an NFS pvc, but that leaves the grafana and prometheus pods pending and hence the install fails. Similarly, deploying the *hub chart first leaves the hub pod in a pending state. So I currrently don't have a way of deploying either the support chart or hub-template chart first without a failure.

Possible solutions:

Give all support chart dependencies an condition: <chart name>.enabled flag so we can control the order of deployment
Something else?

yuvipanda · 2021-08-19T16:46:33Z

but that leaves the grafana and prometheus pods pending and hence the install fails.

What are they waiting on, do you know? At first glance I don't know what there would be dependent on a hub being deployed.

Separately, only the user home directories should be on NFS. Putting our hub db on NFS will lead to poor performance, since sqlite (which is what we use with our hub) works poorly on NFS.

choldgraf · 2021-08-19T18:51:04Z

I have a meta question - this PR is making this NFS behavior true for all of our hubs, not just for Pangeo, right? If so, is there a place in pilot-hubs where this new functionality could be documented? I worry that, 3 months from now, it will be hard to reverse-understand how our NFS stuff is being provisioned without some digging.

Does it make sense to have a reference section that simply describes the major components of our hub template? (e.g. "where does XXX come from?" "where is YYY configured?") or is it enough to just use lots of comments in our YAML?

sgibson91 · 2021-08-19T18:55:41Z

this PR is making this NFS behavior true for all of our hubs, not just for Pangeo, right?

Not quite true. It has to be enabled and is disabled by default (see line 37 in support/Chart.yaml - doing links to lines on the mobile app is not easy!). So other hubs shouldn't be impacted if/when this is merged.

choldgraf · 2021-08-19T18:59:08Z

Ah that is good to know! But I think that is actually another good example of something that we'd ideally have reference documentation on. E.g. how would somebody unfamiliar with this infrastructure quickly learn how NFS is provisioned, what the default behavior is, etc? Since it is disabled by default, how does somebody know when to enable it?

sgibson91 · 2021-08-19T19:03:46Z

All good questions that I'm trying to figure out myself :)

choldgraf · 2021-08-19T19:24:29Z

@consideRatio you mentioned that you had been doing a dive into NFS stuff lately. Would you mind taking a look at @yuvipanda's questions and my questions above? Would you be willing to help advise on what kind of documentation would best-capture this? (or, whether there should be upstream documentation improvements for this?)

sgibson91 · 2021-08-23T12:11:50Z

What do you think about tracking the "clean install" story in a new issue before this PR gets too unwieldy? It might also help distinguish between "technically blocked because something doesn't work" and "something needs more discussion/process/documenting"

yuvipanda · 2021-08-23T12:27:58Z

@sgibson91 +1 on opening a new clean issue.

have basehub.jupyterhub.proxy.https.enabled = false, but I'm wondering if there's anything in daskhub that's trying to talk to ingress-nginx that needs disabling?

Anything from kubectl get ingress -A? Might show us if ingress objects are being added.

sgibson91 · 2021-08-23T12:33:08Z

Anything from kubectl get ingress -A? Might show us if ingress objects are being added.

$ kubectl get ingress -A
No resources found

consideRatio · 2021-08-23T12:40:08Z

Oh I had to allow 6443 before, but it seems now that they have changed the port.

  # The port that the webhook should listen on for requests.
  # In GKE private clusters, by default kubernetes apiservers are allowed to
  # talk to the cluster nodes only on 443 and 10250. so configuring
  # securePort: 10250, will work out of the box without needing to add firewall
  # rules or requiring NET_BIND_SERVICE capabilities to bind port numbers <1000
  securePort: 10250

@sgibson91 I believe that cert-manager may give you trouble if you try to disable the webhook in modern versions. What version are you installing? You pointed documentation about version 0.9 which is very old at this point (on Jul 23, 2019).

From the error, I believe that the k8s api-server tries to communicate with the cert-manager webhook using HTTPS and requires the webhook to have a known certificate to communicate with the k8s api-server via HTTPS. I think this may be something that is setup via the creation of CertificateSigningRequest resources or similar, and that it may be something that requires some time to pass before it works.

When you have things fail with an enabled webhook, it may be because not enough time has elapsed for cert-manager to properly get up and running. I'm not at all sure though.

sgibson91 · 2021-08-23T12:46:20Z

Found the same info in v1.5: https://cert-manager.io/docs/concepts/webhook/#webhook-connection-problems-on-gke-private-cluster

Version being installed: https://github.com/2i2c-org/pilot-hubs/blob/386d9a0b351878d693d7fd8a500222e726a90792/deployer/hub.py#L84

I would reenable it after I'd deployed the hub and setup a DNS record. The whole sticking point of this PR is that other stuff in the support chart is trying to do it's thing before I'm ready for it because I don't have a hub up yet.

What I want to do is the following:

Deploy nfs-server-provisioner
Deploy a hub
Deploy other support components, i.e. prometheus, grafana, ingress-nginx, setup the DNS and then let cert-manager do it's thing.

But at the minute, helm seems to be complaining about not finding things that haven't been installed yet and I'm not sure what part of basehub/daskhub/whatever is making those calls.

sgibson91 · 2021-08-23T13:00:48Z

Deleting cluster, starting again from scratch. Only condition should be to run deploy-support and then deploy.

sgibson91 · 2021-08-23T13:37:33Z

So on a fresh cluster:

The whole of the support chart installed fine (no disabling of any sub-charts, just enabled nfs-server-provisioner)

Then tried to deploy the hub and got:

Error: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://support-ingress-nginx-controller-admission.support.svc:443/networking/v1beta1/ingresses?timeout=10s": context deadline exceeded

But all pods came up successfully

… by a flag too" This reverts commit ab9e77d.

consideRatio · 2021-08-23T13:40:27Z

Hmmmm? What is the support-ingress-nginx-controller-admission service in the support namespace? It seems like a k8s resource validation webhook is registered by the ingress-nginx controller as well, is that correct? I didn't know they did that.

Oh! I guess that is correct.

ingress-nginx-admission 1 8d
cert-manager-webhook 1 8d

  ports:
  - name: https-webhook
    port: 443
    protocol: TCP
    targetPort: webhook

It points to a pod's "webhook" port, which is defined to be 8443.

    - containerPort: 8443
      name: webhook
      protocol: TCP

So, you need to allow for port 8443 between the peered GKE project where the k8s api-server run and the actual GCP project we manage then it seems. Or, perhaps configure that port to be different?

sgibson91 · 2021-08-23T13:42:32Z

@consideRatio yes it's in the docs here https://kubernetes.github.io/ingress-nginx/how-it-works/#avoiding-outage-from-wrong-configuration

consideRatio · 2021-08-23T13:47:03Z

@sgibson91 perhaps you can update the config so this template renders to a already allowed port? Or disable it as well.

          {{- if .Values.controller.admissionWebhooks.enabled }}
            - name: webhook
              containerPort: {{ .Values.controller.admissionWebhooks.port }}
              protocol: TCP
          {{- end }}

https://github.com/kubernetes/ingress-nginx/blob/605c243d7ae49e11202ea106bebc205b45b26333/charts/ingress-nginx/templates/controller-deployment.yaml#L169-L173

sgibson91 · 2021-08-23T13:49:06Z

I tried adding another firewall rule to terraform as well, but that didn't seem to help

resource "google_compute_firewall" "ingress_nginx_webhook_validation_ingress" {
  count = var.enable_private_cluster ? 1 : 0

  name    = "ingress-ngix-webhook-validation"
  project = var.project_id
  network = data.google_compute_network.default_network.name

  allow {
    protocol = "tcp"
    ports    = ["8443"]
  }

  // This range contains all IP addresses that IAP uses for TCP forwarding.
  // https://cloud.google.com/iap/docs/using-tcp-forwarding
  source_ranges = ["35.235.240.0/20"]
}

yuvipanda · 2021-08-23T14:02:02Z

let's turn off the nginx-ingress validation webhook too and try? https://github.com/kubernetes/ingress-nginx/blob/605c243d7ae49e11202ea106bebc205b45b26333/charts/ingress-nginx/values.yaml#L525

sgibson91 · 2021-08-23T14:53:47Z

Disabling the admissions webhook seemed to work. Got a hub in CrashLoopBackOff though

  Warning  Unhealthy               61s (x12 over 83s)  kubelet                  Readiness probe failed: Get "http://10.8.0.32:8081/hub/health": dial tcp 10.8.0.32:8081: connect: connection refused

Probably need to allow traffic to/from port 8081 as well

Update: Literally as I typed this it resolved itself

sgibson91 · 2021-08-23T15:00:13Z

I disabled the webhook over here in #597

sgibson91 · 2021-08-23T15:02:15Z

*collapses* https://staging.pangeo.2i2c.cloud/

yuvipanda · 2021-08-30T12:15:34Z

support/values.yaml

+    accessMode: ReadWriteOnce
+    size: 100Gi
+    # Future option is to reference an xfs storage class. This will allow us to enable quotas.
+    # https://github.com/pangeo-data/pangeo-cloud-federation/issues/654#issuecomment-861771398


On most cloud providers, the storageClass is a reference to the underlying physical storage only - spinning disks or SSDs. I think the nfs external provisioner already uses XFS, and so the storageClass wouldn't come up. I think the reference in the provided link is for bare metal instances only.

Since we also use out-of-cluster NFS in a few places, let's name this explicitly.

yuvipanda · 2021-08-30T12:32:05Z

I made some minor changes, this is good to go!

yuvipanda · 2021-08-30T12:32:16Z

Great work, @consideRatio and @sgibson91!

choldgraf · 2021-08-30T15:23:19Z

Amazing! thanks everybody for collaborating on this and getting it through!

I've updated #629 to track documenting this configuration, so that we don't lose it. I think we should document it relatively soon while the information is still fresh.

sgibson91 added 6 commits August 12, 2021 14:46

First pass at deploying NFS server in k8s

46afd0b

Second pass at YAML for NFS in k8s

a3552ef

Third pass at NFS in k8s

aac614c

Delete nfs-server template

4d89ca3

Add nfs-server-provisioner as support chart dependency

d8b444e

configure nfs-server-provisioner is support chart values.yaml

1d718d3

sgibson91 requested review from consideRatio and yuvipanda August 17, 2021 14:30

choldgraf assigned sgibson91 Aug 17, 2021

consideRatio reviewed Aug 17, 2021

View reviewed changes

support/Chart.yaml Show resolved Hide resolved

Make inclusion of nfs-server-provisioner conditional

090b9b3

Co-authored-by: Erik Sundell <[email protected]>

yuvipanda requested changes Aug 18, 2021

View reviewed changes

support/values.yaml Outdated Show resolved Hide resolved

support/Chart.yaml Show resolved Hide resolved

sgibson91 added 3 commits August 18, 2021 15:58

Disable using nfs-server-provisioner chart by default

8242a88

Use version 4 of NFS client by default

b714fdc

Merge branch 'automate-nfs' of github.com:sgibson91/pilot-hubs into a…

11a12ab

…utomate-nfs

yuvipanda requested changes Aug 19, 2021

View reviewed changes

support/values.yaml Outdated Show resolved Hide resolved

sgibson91 commented Aug 19, 2021

View reviewed changes

support/values.yaml Show resolved Hide resolved

Add values reference for nfs-server-provisioner

7e431e3

sgibson91 added the need discussion label Aug 19, 2021

sgibson91 mentioned this pull request Aug 19, 2021

Add config and deploy Pangeo staging hub #597

Merged

choldgraf requested a review from consideRatio August 19, 2021 19:23

Revert "Allow prometheus, grafana and ingress-nginx to be en/disabled…

b04191f

… by a flag too" This reverts commit ab9e77d.

This was referenced Aug 23, 2021

Create reference documentation #628

Closed

Staging Hub deployment for Pangeo #599

Closed

choldgraf mentioned this pull request Aug 23, 2021

Run NFS servers in-cluster #50

Closed

2 tasks

yuvipanda reviewed Aug 30, 2021

View reviewed changes

yuvipanda added 2 commits August 30, 2021 17:58

Change nfs-ganesha's storageclass name

2e411f6

Since we also use out-of-cluster NFS in a few places, let's name this explicitly.

Clarify nfs-external-provisioner options

46805d3

yuvipanda merged commit c1d06be into 2i2c-org:master Aug 30, 2021

choldgraf mentioned this pull request Sep 4, 2021

Blog post: 2i2c quarterly update Q3 2i2c-org/team-compass#235

Closed

4 tasks

sgibson91 mentioned this pull request Oct 19, 2021

Enabling per hub storage quotas #764

Closed

sgibson91 deleted the automate-nfs branch March 3, 2022 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use nfs-server-provisioner subchart in support #613

Use nfs-server-provisioner subchart in support #613

sgibson91 commented Aug 17, 2021 •

edited by yuvipanda

Loading

yuvipanda left a comment

yuvipanda left a comment

sgibson91 commented Aug 19, 2021

yuvipanda commented Aug 19, 2021

choldgraf commented Aug 19, 2021

sgibson91 commented Aug 19, 2021

choldgraf commented Aug 19, 2021

sgibson91 commented Aug 19, 2021

choldgraf commented Aug 19, 2021

sgibson91 commented Aug 23, 2021

yuvipanda commented Aug 23, 2021

sgibson91 commented Aug 23, 2021

consideRatio commented Aug 23, 2021 •

edited

Loading

sgibson91 commented Aug 23, 2021

sgibson91 commented Aug 23, 2021

sgibson91 commented Aug 23, 2021

consideRatio commented Aug 23, 2021 •

edited

Loading

sgibson91 commented Aug 23, 2021

consideRatio commented Aug 23, 2021 •

edited

Loading

sgibson91 commented Aug 23, 2021

yuvipanda commented Aug 23, 2021

sgibson91 commented Aug 23, 2021 •

edited

Loading

sgibson91 commented Aug 23, 2021

sgibson91 commented Aug 23, 2021

yuvipanda Aug 30, 2021

yuvipanda commented Aug 30, 2021

yuvipanda commented Aug 30, 2021

choldgraf commented Aug 30, 2021

Use nfs-server-provisioner subchart in support #613

Use nfs-server-provisioner subchart in support #613

Conversation

sgibson91 commented Aug 17, 2021 • edited by yuvipanda Loading

TODOs

yuvipanda left a comment

Choose a reason for hiding this comment

yuvipanda left a comment

Choose a reason for hiding this comment

sgibson91 commented Aug 19, 2021

yuvipanda commented Aug 19, 2021

choldgraf commented Aug 19, 2021

sgibson91 commented Aug 19, 2021

choldgraf commented Aug 19, 2021

sgibson91 commented Aug 19, 2021

choldgraf commented Aug 19, 2021

sgibson91 commented Aug 23, 2021

yuvipanda commented Aug 23, 2021

sgibson91 commented Aug 23, 2021

consideRatio commented Aug 23, 2021 • edited Loading

sgibson91 commented Aug 23, 2021

sgibson91 commented Aug 23, 2021

sgibson91 commented Aug 23, 2021

consideRatio commented Aug 23, 2021 • edited Loading

sgibson91 commented Aug 23, 2021

consideRatio commented Aug 23, 2021 • edited Loading

sgibson91 commented Aug 23, 2021

yuvipanda commented Aug 23, 2021

sgibson91 commented Aug 23, 2021 • edited Loading

sgibson91 commented Aug 23, 2021

sgibson91 commented Aug 23, 2021

yuvipanda Aug 30, 2021

Choose a reason for hiding this comment

yuvipanda commented Aug 30, 2021

yuvipanda commented Aug 30, 2021

choldgraf commented Aug 30, 2021

sgibson91 commented Aug 17, 2021 •

edited by yuvipanda

Loading

consideRatio commented Aug 23, 2021 •

edited

Loading

consideRatio commented Aug 23, 2021 •

edited

Loading

consideRatio commented Aug 23, 2021 •

edited

Loading

sgibson91 commented Aug 23, 2021 •

edited

Loading