-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
volume mode "Persistent" not supported by driver ebs.csi.aws.com (only supports []) #419
Comments
This seems an issue related to volume lifecycle mode that got added since 1.16. In this case an existing
|
I suspect the problem is due to the defaulting logic was only added since 1.16. So that a What's the output of And could you try reinstall the CSI driver and see if it works? You might want to check the I also see your master node is still 1.15.3, you should upgrade master to 1.16 before upgrading worker nodes. |
Hi leakingtapan - thanks for looking into this subject and giving some insights as to what is going wrong. It took me a few days to get back to this and replicate the problem with a 4 node testing cluster. So I can confirm that the same error message shows up when joining a 1.16.3 worker node to a functional 1.15.6 cluster (1 master - 2 workers) - so far, no news. Since I learned that when upgrading the whole cluster, starting with the master-nodes usually is the safer route to go down - as a master might be "more" backwards-compatible than a client - I did just that this time. I updated the only master node to v1.16.3 and then joined the v1.16.3 worker-node to the cluster again - giving me 2 workers on v1.15.6, 1 worker and 1 control-plane-node on v1.16.3. Now trying to schedule a pod with storage (aka. utilizing the aws-ebs-csi-driver) onto the v1.16.3 worker will not show the error message from the title, but rather just error out with the below message and again, the pod is stuck in For a static provisioned pod these events show up:
And this shows up for a dynamically provisioned pod:
Now I had an expectation (or more a gut feeling) that these pods would not run on v1.16.3 worker nodes after the update - but what really puzzles me right now is: they won't even get scheduled onto one of the old 1.15.6 workers (of which we still got two in the cluster by now) Below you can see the kubectl journal from the v1.15.6 node that was not able to schedule the aws-ebs-csi-driver backed pods that ran on the same node perfectly while the cluster was still at v1.15.6.
Up until now the v1.15.6 deployed
Applying the helm-generated aws-ebs-csi manifest again gives this (correct - as nothing changed!) output:
So the next step was to delete the whole this is what
so, nothing really changed here - but the problem persists. it actually got worse, since no pod at all will be able to get scheduled. only thing I noticed is the difference in the event logs - depending on if it is a statically or dynamically provisioned pod. Events of the static provisioned pod:
Events of the dynamic provisioned pod:
So - after all of this - we arrive again at the point where upgrading from v1.15.x to v1.16.x will break aws-ebs-csi-drivers functionality for our cluster. I also don't see where our setup would have that special, snow-flakey component build into it, the one that would break the whole setup we came up with in really mysterious ways. If you need me to post more information or perform further troubleshooting steps - please let me know and I'd be more than happy to help out in any way I can - I will let this cluster sit as it is for now, as it was purely build for testing this weird behaviour. I'll probably try to bootstrap a v1.16.x cluster and see if I get lucky with the csi-storage for AWS when starting fresh on a green field. |
Looks like the original problem is gone by upgrading master node to 1.16 first. For the following error:
Feels something wrong with volume attachment. What's the log of And Im assuming you are running v0.4 driver on both 1.15 and 1.16 worker nodes? |
So the cluster is on v1.17.0 already, as I tried to resolve the problem via updating the setup a few weeks back already. It is still the same 4-node single master setup I initially started with as described above. We are running these components in the cluster:
Trying to deploy a dynamic volume-claim via this code does not work:
Trying to do the above, I get these logs from the ebs-csi-controller pod:
Looking at the dynamic volume-claim, I can see this output: kubectl describe pvc aws-csi-dynamic-claim
Then nothing more happens when trying with dynamic provisioning. I will add the output for the static volume setup later. |
@gini-schorsch
How did you setup the IAM permission for the driver? |
We are also experiencing this issue. We are only deploying the ebs-plugin on our controller nodes. Our IAM permissions for controller nodes look like this:
We have a slightly different setup than OP (Flannel CNI, using in-tree AWS provider instead of CCM, 3 controller nodes in a 6 node cluster), but otherwise, the details look about the same. All of our volumes are dynamically provisioned. One interesting thing is that one of our deployments (a Prometheus deployment) seems to be running just fine. It consists of 2 pods, each with 1 PVC, and the corresponding volumes for those PVCs are being mounted. However, the other deployments with PVCs we're attempting to deploy do not start up and display the behavior described in the original post:
|
@bartelsb see some of the above conversations for your context. What's your cluster version and what's your upgrade process look like? The recommended process will be 1) upgrade master node 2) upgrade worker node 3) upgrade addons (eg. CSI driver) |
We start with k8s 1.15.5. We deploy the CSI driver using the |
We are considering pulling out the CSI driver entirely, so that if this is related to our Kubernetes version in some way, deploying the CSI driver doesn't block us from upgrading to v1.17. |
so for us the issue was resolved after fixing a misconfigured CNI setup, which prevented inter-node-communication and thus a provisioning of storage never got triggered. We have not tried upgrading our current working cluster (v1.15.x) to any newer versions, but we can confirm that mounting volumes and provisioning storage works on v1.17.x when starting from scratch (aka. building a new test-cluster in our case). we are using the specs provided above by @gini-schorsch - but since opening this issue we also moved to the external AWS cloud-controller-manager (aka. aws-cloud-controller-manager) we have been using the provided IAM profiles for both components (CSI and CCM) and cut them down to the use-cases we require for our operations and did not see any problems with that so far. @bartelsb can you check again if your network setup is correct? also check security-groups if all ports needed are allowed to reach your endpoints. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
What happened?
We want to upgrade our current kubernetes cluster running on debian-9-stretch (backports kernel 4.19.0) from v1.15.3 to v1.16.3
To test this - in a non-destructive way - we spun up an additional spot-worker-node (debian-10-buster based with backports kernel 5.3.0) running kubelet v1.16.3 and added that node without problems to the cluster.
As far as we understand version-skew, this should not be a problem.
But if we now schedule a deployment that needs statically provisioned storage via the the
driver: ebs.csi.aws.com
onto that v1.16.3 node, we see the below messages appear in the event-log of the pod description:kubelet is throwing these errors at us:
The pod gets stuck in
ContainerCreating
and is never spawned. As soon as the deployment gets scheduled onto a debian-stretch based v1.15.3 worker, disks get attached and mounted without a problem.What you expected to happen?
Expectation would be that with a cluster-update from v1.15.3 to v1.16.x the aws-ebs-csi plugin would continue to function without problems - as was the case with earlier releases (here: v1.15.3)
Attaching and mounting disk was working for our setup without major problems. The process might take a while - we also had a few cases of manual kubelet restarts needed (every other month due to unplanned disruption of services having storage mounted)
General stability of the driver was given for us, up until moving to v1.16.3
How to reproduce it (as minimally and precisely as possible)?
Provision a small cluster with kubernetes v1.15.x (patchlevel 3 in our case - we haven't tried newer 1.15.x releases so far) - we use latest stable cilium-1.6.3 (CRD backed solution) as CNI.
Bootstrap the cluster with an
InitConfiguration
object that declarescloud-provider: external
to thekubeletExtraArgs:
vianodeRegistration:
mechanism. Each worker- and control-plane-node gets the samecloud-provider: external
declaration via the respectiveJoinConfiguration
object.Spawn a daemon-set for the AWS cloud-controller-manager and register the aws-ebs-csi-driver with the cluster.
Create a storage class and statical (aka. pre-) provision a volume (we created ebs-volumes from snapshots for example). Make sure the volume object gets registered within the cluster.
Declare a deployment and a PVC to consume the storage created in the last step. You should now have a deployment that uses the PV and it should be bound to the deployment via the claiming PVC object. All this is running on v1.15.x worker-nodes.
Watch the description of the pod (created by the storage-consuming deployment / replica-set) to see the pre-provisioned volume get attached and mounted successfully into the pod.
Now add a v1.16.3 worker to the cluster. Watch kube-proxy / CNI / aws-ebs-csi daemonsets spawn pods on the node. After these are running, stateless deployments get scheduled onto the node and their workloads run without a slightest problem.
Things start to fall apart, when we schedule a deployment which wants to consume storage via
aws.ebs.csi.driver
While the pod gets scheduled onto the v1.16.x node and the pod is in stateContainerCreating
- it will get stuck and sit forever there complaining about not being able to mount the volume into the pod - as can be seen here:Anything else we need to know?:
current v1.15.3 cluster runs on plain ec2-instances using debian stretch.
future v1.16.x cluster should run on plain ec2-instances utilizing debian buster.
v1.16.x nodes run fine in another testing cluster where we switched to an all-buster-setup already - once you work around the edges of the OS (like iptables-legacy) it's just debian and we are very confident that problems are not related to the different versions of the OS.
We also noticed that there is only
k8s.gcr.io/cloud-controller-manager:v1.15.x
provided - there seems to be nov1.16.x
for cloud-controller-manager - here we just stuck to the v1.15.3 version as the new v1.16.3 node was registered with the CCM without problems upon joining the cluster and we did not notice any odd behaviour.Environment
about 15-20 (medium to large sized) AWS EC2 instances running debian stretch / buster - utilizing distro-specific latest backports kernel.
on top of that we used kubeadm to bootstrap and manage the v1.15.3 cluster. CNI (aka. networking) is implemented by (latest stable) cilium-1.6.3 (running no k/v-store).
cloud-provider-aws (aka. external cloud-controller-manager) is running with v1.15.3 - aws-ebs-csi-driver is v0.4.0 - I just updated our cluster according to the docs here
was used to generate all the several objects, as I had to delete some objects that would not allow to change immutable fields (aka. the DaemonSet and StatefulSet objects) when trying to apply the manifest directly.
kubectl version
):https://github.com/kubernetes-sigs/aws-ebs-csi-driver/releases/download/v0.4.0/helm-chart.tgz
The text was updated successfully, but these errors were encountered: