-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use nfs-server-provisioner subchart in support #613
Conversation
Co-authored-by: Erik Sundell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thank you, @sgibson91 and @consideRatio!
I've a few questions about how the data is stored longer term.
- What are the common ways data loss can occur? Does deleting the PV delete the backing disk?
- Can we resize them later if required? If so, how?
- Will we have one disk per hub or one per cluster?
- How can we do backups?
Losing user home dirs is always the worst case scenario, so I just want to make sure we understand what is happening here.
None of these need to block this PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion, but please merge after that! Thank you :)
I'm noticing some circular dependencies between running Possible solutions:
|
What are they waiting on, do you know? At first glance I don't know what there would be dependent on a hub being deployed. Separately, only the user home directories should be on NFS. Putting our hub db on NFS will lead to poor performance, since sqlite (which is what we use with our hub) works poorly on NFS. |
I have a meta question - this PR is making this NFS behavior true for all of our hubs, not just for Pangeo, right? If so, is there a place in Does it make sense to have a |
Not quite true. It has to be enabled and is disabled by default (see line 37 in |
Ah that is good to know! But I think that is actually another good example of something that we'd ideally have reference documentation on. E.g. how would somebody unfamiliar with this infrastructure quickly learn how NFS is provisioned, what the default behavior is, etc? Since it is disabled by default, how does somebody know when to enable it? |
All good questions that I'm trying to figure out myself :) |
@consideRatio you mentioned that you had been doing a dive into NFS stuff lately. Would you mind taking a look at @yuvipanda's questions and my questions above? Would you be willing to help advise on what kind of documentation would best-capture this? (or, whether there should be upstream documentation improvements for this?) |
What do you think about tracking the "clean install" story in a new issue before this PR gets too unwieldy? It might also help distinguish between "technically blocked because something doesn't work" and "something needs more discussion/process/documenting" |
@sgibson91 +1 on opening a new clean issue.
Anything from |
|
Oh I had to allow 6443 before, but it seems now that they have changed the port.
@sgibson91 I believe that cert-manager may give you trouble if you try to disable the webhook in modern versions. What version are you installing? You pointed documentation about version 0.9 which is very old at this point (on Jul 23, 2019). From the error, I believe that the k8s api-server tries to communicate with the cert-manager webhook using HTTPS and requires the webhook to have a known certificate to communicate with the k8s api-server via HTTPS. I think this may be something that is setup via the creation of CertificateSigningRequest resources or similar, and that it may be something that requires some time to pass before it works. When you have things fail with an enabled webhook, it may be because not enough time has elapsed for cert-manager to properly get up and running. I'm not at all sure though. |
Found the same info in v1.5: https://cert-manager.io/docs/concepts/webhook/#webhook-connection-problems-on-gke-private-cluster Version being installed: https://github.com/2i2c-org/pilot-hubs/blob/386d9a0b351878d693d7fd8a500222e726a90792/deployer/hub.py#L84 I would reenable it after I'd deployed the hub and setup a DNS record. The whole sticking point of this PR is that other stuff in the What I want to do is the following:
But at the minute, helm seems to be complaining about not finding things that haven't been installed yet and I'm not sure what part of |
Deleting cluster, starting again from scratch. Only condition should be to run |
So on a fresh cluster: The whole of the support chart installed fine (no disabling of any sub-charts, just enabled Then tried to deploy the hub and got:
But all pods came up successfully |
… by a flag too" This reverts commit ab9e77d.
Hmmmm? What is the Oh! I guess that is correct.
ports:
- name: https-webhook
port: 443
protocol: TCP
targetPort: webhook It points to a pod's "webhook" port, which is defined to be 8443. - containerPort: 8443
name: webhook
protocol: TCP So, you need to allow for port 8443 between the peered GKE project where the k8s api-server run and the actual GCP project we manage then it seems. Or, perhaps configure that port to be different? |
@sgibson91 perhaps you can update the config so this template renders to a already allowed port? Or disable it as well.
|
I tried adding another firewall rule to terraform as well, but that didn't seem to help
|
let's turn off the nginx-ingress validation webhook too and try? https://github.com/kubernetes/ingress-nginx/blob/605c243d7ae49e11202ea106bebc205b45b26333/charts/ingress-nginx/values.yaml#L525 |
Disabling the admissions webhook seemed to work. Got a hub in CrashLoopBackOff though
Probably need to allow traffic to/from port 8081 as well Update: Literally as I typed this it resolved itself |
*collapses* https://staging.pangeo.2i2c.cloud/ |
support/values.yaml
Outdated
accessMode: ReadWriteOnce | ||
size: 100Gi | ||
# Future option is to reference an xfs storage class. This will allow us to enable quotas. | ||
# https://github.com/pangeo-data/pangeo-cloud-federation/issues/654#issuecomment-861771398 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On most cloud providers, the storageClass is a reference to the underlying physical storage only - spinning disks or SSDs. I think the nfs external provisioner already uses XFS, and so the storageClass wouldn't come up. I think the reference in the provided link is for bare metal instances only.
Since we also use out-of-cluster NFS in a few places, let's name this explicitly.
I made some minor changes, this is good to go! |
Great work, @consideRatio and @sgibson91! |
Amazing! thanks everybody for collaborating on this and getting it through! I've updated #629 to track documenting this configuration, so that we don't lose it. I think we should document it relatively soon while the information is still fresh. |
This PR adds the
nfs-server-provisioner
chart as a dependency ofsupport
and allows PVCs to be provisioned as an nfs type. We would no longer need to manually provision an NFS server.TODOs
New issues:
Fixes #50