Upgrade from 1.5.9 to 1.6.0 breaks the EFS #1111

sumeet-zuora · 2023-08-22T12:36:17Z

/kind bug

What happened?

After upgrading from 1.5.9 -> 1.6.0, started getting errors Output: Error retrieving region. Please set the "region" parameter in the efs-utils configuration file.

What you expected to happen?

EFS should get mounted

How to reproduce it (as minimally and precisely as possible)?

Uprgade EFS from 1.5.9 --> 1.6.0

Anything else we need to know?:

Did verify the IAM policy it does have "ec2:DescribeAvailabilityZones"
On Side note we use cilium, and did see hosnetwork was removed in 1.6.0 Helm Chart deployment

Environment

Kubernetes version (use kubectl version): 1.24
Driver version: 1.6.0

Please also attach debug logs to help us better diagnose

[home-init kube-api-access-wqlzk vault-secrets aws-iam-token downloads]: timed out waiting for the condition
  Warning  FailedMount  4m52s (x6 over 43m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[downloads], unattached volumes=[aws-iam-token downloads home-init kube-api-access-wqlzk vault-secrets]: timed out waiting for the condition
  Warning  FailedMount  29s (x30 over 45m)   kubelet            MountVolume.SetUp failed for volume "pvc-c07f079e-e4c8-4c6e-9beb-2308ebefa88a" : rpc error: code = Internal desc = Could not mount "fs-0597418a1c1470d57:/" at "/var/lib/kubelet/pods/e685202f-4b90-4a14-8bba-14307d26bea0/volumes/kubernetes.io~csi/pvc-c07f079e-e4c8-4c6e-9beb-2308ebefa88a/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0ad8341ef6779e9f1,tls fs-0597418a1c1470d57:/ /var/lib/kubelet/pods/e685202f-4b90-4a14-8bba-14307d26bea0/volumes/kubernetes.io~csi/pvc-c07f079e-e4c8-4c6e-9beb-2308ebefa88a/mount
Output: Error retrieving region. Please set the "region" parameter in the efs-utils configuration file.

Instructions to gather debug logs can be found here

The text was updated successfully, but these errors were encountered:

david-a-morgan · 2023-08-25T17:15:41Z

This is similar to an issue we are seeing, so I'll some additional context. We have a Cilium network policy to allow the controller egress access to AWS but not IMDS. CSI nodes do not have egress access to anything. The controller is using IRSA.

The controller logs indicate that it will use Kubernetes for metadata, but when trying to provision or delete a PV it is reaching out to IMDS and timing out.

RyanStan · 2023-09-12T14:39:33Z

@david-a-morgan and others experiencing the issue: how are you installing the driver? Through kustomize?

This recent commit removed hostNetwork = true from the node daemonset. This means that the Node Daemonset Pods cannot use IMDS for getting the region.

A while ago, this commit was merged which allows us to pull EC2 info from Kubernetes instead of IMDS if IMDS is enabled. However, it requires the CSI_NODE_NAME env variable to be set. The author of this commit only added it to the Controller Deployment, but our Node Daemonset needs it as well. Thus, that's why the commit above, where hostNetwork = true was removed, also introduced this CSI_NODE_NAME to our Helm Chart. However, it wasn't added to the Node daemonset spec that our Kustomize files use.

I'll open a PR to add it in. However, this brings up two additional points:

We should update the E2E tests to test the driver installed via Helm, and the driver installed via Kustomize.
We need a way to run the E2E tests against multiple clusters with different configurations, like E2E disabled.

I'll open up issues on the project to track the above two items.

david-a-morgan · 2023-09-12T17:06:19Z

We install the driver using Helm. I did notice the new CSI_NODE_NAME on the node DaemonSet when comparing chart version differences.

Here are more details as to what we are experiencing:

Volume provisioning and deprovisioning usually times out multiple times before eventually succeeding, but only if egress to IMDS is blocked. It can take 10-20 minutes of retries before de/provisioning succeeds.
Volume provisioning and deprovisioning never succeed if egress to IMDS is not blocked.

In both cases Hubble shows that there are repeated attempts to egress to IMDS even when the controller is using Kubernetes metadata.

RyanStan · 2023-09-13T13:52:18Z

/reopen

k8s-ci-robot · 2023-09-13T13:52:21Z

@RyanStan: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Ashley-wenyizha · 2023-09-13T14:28:28Z

/reopen

k8s-ci-robot · 2023-09-13T14:28:33Z

@Ashley-wenyizha: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RyanStan · 2023-09-13T15:49:41Z

Ok, I think I figured out the issue. The CSI Driver uses the region it pulls from Kubernetes metadata to build a client to the EFS API. This is working as expected. However, the utility that the csi driver uses under the hood for performing mounts to EFS, efs-utils, requires IMDS to find the region, which it then uses to construct the DNS name for the mount target.

The reason that I didn't run into this issue when initially trying to recreate it was because my efs-utils configuration file had been hardcoded with the correct region, so IMDS was not needed.

The immediate solution here is to add hostNetwork=true back into the Node Daemonset.

And for the long term solution:
I'd like to see us add a mount option in efs-utils to support configuring the region that way, instead of through the config file. There is already an open PR for this: https://github.com/aws/efs-utils/pull/171/files. Once that is merged, we can modify this driver to pass in the region it pulls from the Kubernetes Node spec as a mount option to efs-utils.

We will also need to update our testing infra to test against a IMDS disabled cluster.

RyanStan · 2023-09-18T13:43:59Z

@david-a-morgan and others that experienced this issue:

Were you performing a cross-region mount? Also, were you using IRSA with your Node Daemonset Pods (e.g. annotating them with an IAM Role)? I assume the answer to this second question is no, because our current documentation doesn't list this as a requirement, but this will need to change.

I was looking into this a bit more to try, and I found that the watchdog process should overwrite the region in the efs-utils configuration with the AWS_DEFAULT_REGION. This environment variable is set from IRSA to the region that the cluster is in. However, for @sumeet-zuora (and I assume others), this variable was not set
Output: Error retrieving region. Please set the "region" parameter in the efs-utils configuration file.. Since this wasn't set, efs-utils tried falling back on IMDS (which isn't possible in this case).

bd-spl · 2023-11-28T10:53:01Z

watchdog process should overwrite the region in the efs-utils configuration with the AWS_DEFAULT_REGION

it seems that the Region value is not wired into the config template

aws-efs-csi-driver/pkg/driver/efs_watch_dog.go

Line 112 in 9530c7d

source={{.EfsClientSource}}

codeMehtab · 2024-03-15T12:45:55Z

I am still facing this issue, below are the details:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:49:13Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.12-eks-5e0fdde", GitCommit:"95c835ee1111774fe5e8b327187034d8136720a0", GitTreeState:"clean", BuildDate:"2024-01-02T20:34:50Z", GoVersion:"go1.20.12", Compiler:"gc", Platform:"linux/amd64"}

I am using HELM to install the driver:

{ chart = "aws-efs-csi-driver",
    repo = "https://kubernetes-sigs.github.io/aws-efs-csi-driver",
  version = "2.4.9" }

Error: Output: Error retrieving region. Please set the "region" parameter in the efs-utils configuration file.
Warning FailedMount 7 secs ago

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 22, 2023

This was referenced Sep 12, 2023

Add CSI_NODE_NAME to node Daemonset #1124

Merged

E2E tests should test installation of driver through kustomize #1125

Closed

k8s-ci-robot closed this as completed in #1124 Sep 12, 2023

k8s-ci-robot reopened this Sep 13, 2023

RyanStan mentioned this issue Sep 13, 2023

Removing hostNetwork: true breaks upgrades #1127

Closed

This was referenced Sep 13, 2023

Add hostnetwork back to Node Daemonset #1130

Merged

Remove hostNetwork: true from Node Daemonset #1131

Closed

k8s-ci-robot closed this as completed in #1130 Sep 19, 2023

philnichol mentioned this issue Oct 15, 2024

EKS Addon install missing AWS_DEFAULT_REGION #1476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade from 1.5.9 to 1.6.0 breaks the EFS #1111

Upgrade from 1.5.9 to 1.6.0 breaks the EFS #1111

sumeet-zuora commented Aug 22, 2023

david-a-morgan commented Aug 25, 2023

RyanStan commented Sep 12, 2023

david-a-morgan commented Sep 12, 2023

RyanStan commented Sep 13, 2023

k8s-ci-robot commented Sep 13, 2023

Ashley-wenyizha commented Sep 13, 2023

k8s-ci-robot commented Sep 13, 2023

RyanStan commented Sep 13, 2023

RyanStan commented Sep 18, 2023 •

edited

Loading

bd-spl commented Nov 28, 2023

codeMehtab commented Mar 15, 2024

Upgrade from 1.5.9 to 1.6.0 breaks the EFS #1111

Upgrade from 1.5.9 to 1.6.0 breaks the EFS #1111

Comments

sumeet-zuora commented Aug 22, 2023

david-a-morgan commented Aug 25, 2023

RyanStan commented Sep 12, 2023

david-a-morgan commented Sep 12, 2023

RyanStan commented Sep 13, 2023

k8s-ci-robot commented Sep 13, 2023

Ashley-wenyizha commented Sep 13, 2023

k8s-ci-robot commented Sep 13, 2023

RyanStan commented Sep 13, 2023

RyanStan commented Sep 18, 2023 • edited Loading

bd-spl commented Nov 28, 2023

codeMehtab commented Mar 15, 2024

RyanStan commented Sep 18, 2023 •

edited

Loading