Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--delete-access-point-root-dir=true might delete wrong dir #1085

Open
MiGrandjean opened this issue Aug 1, 2023 · 9 comments
Open

--delete-access-point-root-dir=true might delete wrong dir #1085

MiGrandjean opened this issue Aug 1, 2023 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@MiGrandjean
Copy link

/kind bug

What happened?

We deployed aws-efs-csi-driver with --delete-access-point-root-dir=true. At some point all contents of the EFS filesystem got deleted. No /pvc-.... folders were left. The monitoring looks like this:

image

Note the drop of storage bytes at 16:30.

What you expected to happen?

Don't delete all EFS AccessPoints/RootDirs.

How to reproduce it (as minimally and precisely as possible)?

I'm still trying to reproduce this behavior. Any hints what might have caused this are appreciated.

Anything else we need to know?:

According to the logs, aws-efs-csi-driver tried to delete an existing PVC pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a minutes before the incident, but failed at first:

{"log":"I0726 16:25:04.800727       1 controller.go:1471] delete \"pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a\": started\n","stream":"stderr","time":"2023-07-26T16:25:04.800967025Z"}
{"log":"I0726 16:25:04.800792       1 controller.go:1279] volume pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a does not need any deletion secrets\n","stream":"stderr","time":"2023-07-26T16:25:04.801012807Z"}
{"log":"W0726 16:25:14.801335       1 controller.go:989] Retrying syncing volume \"pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a\", failure 514\n","stream":"stderr","time":"2023-07-26T16:25:14.801483546Z"}
{"log":"E0726 16:25:14.801598       1 controller.go:1007] error syncing volume \"pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a\": rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-07-26T16:25:14.802019594Z"}
{"log":"I0726 16:25:14.801625       1 event.go:285] Event(v1.ObjectReference{Kind:\"PersistentVolume\", Namespace:\"\", Name:\"pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a\", UID:\"ccc26272-7ea2-4ab9-beb0-e1445e8cea1f\", APIVersion:\"v1\", ResourceVersion:\"41843626\", FieldPath:\"\"}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-07-26T16:25:14.802030534Z"}

The logs stop after that. Five minutes later, the EFS filesystem is empty.

I also found a mount error in the logs some hours earlier for the corresponding Access Point:

{"log":"E0725 09:46:14.712594       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not mount \"fs-0188ca211fd8cc57a\" at \"/var/lib/csi/pv/fsap-0e4937bb9f159e1ec\": mount failed: exit status 32\n","stream":"stderr","time":"2023-07-26T09:46:14.712724126Z"}

AFAICS "Could not mount ..." is the error message in the DeleteVolume function in controller.go if the mount fails during the "//Mount File System at it root and delete access point root directory"-logic.
Is it possible that prior to this, not /pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a got deleted, but /? E.g. can AccessPointRootDir be empty, or sth similar?

16:30 UTC is outside of our core working hours and aws-efs-csi-driver is the only tool that operates with delete permissions on the EFS filesystem.

Environment

  • Kubernetes version (use kubectl version): Server Version: v1.23.17-eks-a5565ad
  • Driver version: v1.5.5

Please also attach debug logs to help us better diagnose

TBD

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 1, 2023
@MiGrandjean
Copy link
Author

MiGrandjean commented Aug 3, 2023

I was able to reproduce the behaviour with enabled debug logs and attached the results of the logcollector as described here:

results.zip

What I did this time:

  1. I deployed aws-efs-csi-driver via the Helm Chart (2.4.3) with these additional values:
controller:
  deleteAccessPointRootDir: true
  logLevel: 5
node:
  logLevel: 5
  1. I enabled stunnel debug in the efs-plugin container on my 1 node
sed -i '/stunnel_debug_enabled = false/s//stunnel_debug_enabled = true/g' /etc/amazon/efs/efs-utils.conf
sed -i '/logging_level = INFO/s//logging_level = DEBUG/g' /etc/amazon/efs/efs-utils.conf
  1. I added some dummy files to the EFS filesystem (using dd) at the top level:
[root@ip-10-1-129-238 ~]# ls -ltr /mnt/efs/
drwx------ 3 rngd input      6144 Aug  2 14:01 pvc-a8f802cb-2bf8-452b-a06d-0835781472eb
drwx------ 3 rngd input      6144 Aug  2 14:27 pvc-a83e4189-5b6d-43b1-855e-c57807220d02
-rw-r--r-- 1 root root  536870912 Aug  3 10:15 dummy_file
-rw-r--r-- 1 root root  536870912 Aug  3 10:16 dummy_file2
-rw-r--r-- 1 root root  536870912 Aug  3 10:16 dummy_file3
drwxr-xr-x 2 root root       6144 Aug  3 10:18 manual_dir
  1. I then deployed 10 instances of our application that uses one PVC per instance. The EFS then looked like this:
[root@ip-10-1-129-238 ~]# ls -ltr /mnt/efs/
total 64
drwx------ 3 rngd input      6144 Aug  2 14:01 pvc-a8f802cb-2bf8-452b-a06d-0835781472eb
drwx------ 3 rngd input      6144 Aug  2 14:27 pvc-a83e4189-5b6d-43b1-855e-c57807220d02
-rw-r--r-- 1 root root  536870912 Aug  3 10:15 dummy_file
-rw-r--r-- 1 root root  536870912 Aug  3 10:16 dummy_file2
-rw-r--r-- 1 root root  536870912 Aug  3 10:16 dummy_file3
drwxr-xr-x 2 root root       6144 Aug  3 10:18 manual_dir
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-5496e45b-7e0c-4e1d-8cfe-22d2d91ec010
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-31b88ebb-aada-4fc5-9664-0528f188c12a
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-17b576c3-fb57-4227-bbd6-896a74831c3c
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-5584947b-06be-49af-9041-aa52a576d1fd
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-2add4d54-2f81-4c79-8b74-04572d7956e6
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-69ab4ca4-65b4-4902-9b97-9f7a9a3f3e3a
drwx------ 3 rngd input      6144 Aug  3 11:46 pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f
drwx------ 3 rngd input      6144 Aug  3 11:46 pvc-9a5abe17-bf4c-4407-aee3-36c168f52269
drwx------ 3 rngd input      6144 Aug  3 11:46 pvc-061e2c50-1561-4be5-9932-6267fc84760e
drwx------ 3 rngd input      6144 Aug  3 11:46 pvc-1293730a-a206-426a-b396-8619b43c1c58
  1. After the deployment was successful, I uninstalled all 10 instances. After a few minutes, the EFS was empty. Even the manually added files were gone:
[root@ip-10-1-129-238 ~]# ls -lah /mnt/efs/
total 4.0K
drwxr-xr-x 2 root root 6.0K Aug  3 12:25 .
drwxr-xr-x 3 root root   17 Aug  3 10:13 ..

This is our Grafana dashboard with EFS Storage Bytes:

image

(please note: The timestamps above are UTC, the ones in the Grafana screenshot are UTC+2)

  1. We collect Container logs with Promtail/Loki and I can see these logs coming from the csi-provisioner Container

What I can see in the Loki logs is that it starts to delete the PVC folders and then fails after 10s with volume deletion failed: rpc error: code = DeadlineExceeded

{"log":"I0803 12:24:35.819681       1 controller.go:1471] delete \"pvc-5584947b-06be-49af-9041-aa52a576d1fd\": started\n","stream":"stderr","time":"2023-08-03T12:24:35.819838196Z"}
{"log":"I0803 12:24:36.533382       1 controller.go:1471] delete \"pvc-2add4d54-2f81-4c79-8b74-04572d7956e6\": started\n","stream":"stderr","time":"2023-08-03T12:24:36.533543967Z"}
{"log":"I0803 12:24:36.826410       1 controller.go:1471] delete \"pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f\": started\n","stream":"stderr","time":"2023-08-03T12:24:36.833902331Z"}
{"log":"E0803 12:24:45.820805       1 controller.go:1481] delete \"pvc-5584947b-06be-49af-9041-aa52a576d1fd\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:45.820941694Z"}
{"log":"E0803 12:24:46.534689       1 controller.go:1481] delete \"pvc-2add4d54-2f81-4c79-8b74-04572d7956e6\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:46.53484165Z"}
{"log":"I0803 12:24:46.821598       1 controller.go:1471] delete \"pvc-5584947b-06be-49af-9041-aa52a576d1fd\": started\n","stream":"stderr","time":"2023-08-03T12:24:46.821876582Z"}
{"log":"E0803 12:24:46.834916       1 controller.go:1481] delete \"pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:46.834982286Z"}
{"log":"I0803 12:24:47.535599       1 controller.go:1471] delete \"pvc-2add4d54-2f81-4c79-8b74-04572d7956e6\": started\n","stream":"stderr","time":"2023-08-03T12:24:47.535778358Z"}
{"log":"I0803 12:24:47.835858       1 controller.go:1471] delete \"pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f\": started\n","stream":"stderr","time":"2023-08-03T12:24:47.836008165Z"}
{"log":"E0803 12:24:54.879658       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Failed to Delete volume fs-0f4f885c773bf17d5::fsap-0c4ed3cd16c584676: Failed to delete access point: fsap-0c4ed3cd16c584676, error: RequestCanceled: request context canceled\n","stream":"stderr","time":"2023-08-03T12:24:54.880908443Z"}
2023-08-03T14:24:56+02:00	{"log":"E0803 12:24:56.240292       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Failed to Delete volume fs-0f4f885c773bf17d5::fsap-0aff425573c27b118: Failed to delete access point: fsap-0aff425573c27b118, error: RequestCanceled: request context canceled\n","stream":"stderr","time":"2023-08-03T12:24:56.240412084Z"}
{"log":"E0803 12:24:56.220653       1 controller.go:1481] delete \"pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f\": volume deletion failed: rpc error: code = Internal desc = Could not unmount \"/var/lib/csi/pv/fsap-0aff425573c27b118\": unmount failed: exit status 32\n","stream":"stderr","time":"2023-08-03T12:24:56.220689937Z"}
{"log":"E0803 12:24:56.822007       1 controller.go:1481] delete \"pvc-5584947b-06be-49af-9041-aa52a576d1fd\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:56.822143818Z"}
{"log":"E0803 12:24:57.535838       1 controller.go:1481] delete \"pvc-2add4d54-2f81-4c79-8b74-04572d7956e6\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:57.536657442Z"}
  1. Another obersvation: Although the EFS filesystem is empty, AWS still shows the AccessPoints:

image


FTR: I was only able to reproduce this with --delete-access-point-root-dir=true, but not with --delete-access-point-root-dir=false, so suspect it has something to do with this setting.

@RyanStan
Copy link
Contributor

RyanStan commented Aug 4, 2023

Thanks for bringing this to our attention. We'll investigate this immediately.

@RyanStan
Copy link
Contributor

RyanStan commented Aug 8, 2023

Using the latest Helm Chart release with out v1.5.5 image, I haven't been able to recreate this issue.

Could you try reproducing the issue with application Pods that don't run your workload? For example, an Amazon Linux image with no arguments. This will help us figure out whether it's the workload that is deleting the files.

If we're unable to reproduce the issue, then I recommend opening a technical support case with AWS. That will allow us to have a private channel to investigate further.

@makomatic
Copy link
Contributor

Could you try reproducing the issue with application Pods that don't run your workload? For example, an Amazon Linux image with no arguments.

I know that I won't convince you by just telling - but I ensure you that it's not the workload. We're running this for years for many many customers in very big production environments. This never happened and I am damn sure that the application does not delete its root ;-)

We deviced to not make use of this feature anymore and clean up by hand - which is kinda disturbing, but - to be honest with you - we plan to change the storage provider anyway and we will not reproduce it again in our environment.
We also can't open another case with AWS (we had one when it happened) because we currently have no problem.

Anyway thx for checking it out.

@mskanth972
Copy link
Contributor

Thank you for reporting this issue to us. Our team has diligently tried to recreate the scenario as you described, but we have not been successful in replicating the problem in our own environment. Additionally, after reviewing the code, it appears that the deletion process should not result in the removal of all contents. Given that we cannot reproduce the issue, our ability to provide further assistance is restricted. Therefore, we will be closing this issue at this time.

@mskanth972 mskanth972 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 5, 2023
@EinavDanielDX
Copy link

EinavDanielDX commented Sep 11, 2024

We are having a similar issue in our testing cluster. it's most frustrating as it's not consistent.
In our environment there are constant E2E tests, creating and deleting namespaces and workloads.
Every once in a while the deletion of a namespace and its resources (always gracefully) results in data termination of all of the files in EFS, even though the access points are not deleted.

In this screenshot you can see it happened 3 times in that last week, data is restored manually each time.
image

@617m4rc
Copy link

617m4rc commented Nov 22, 2024

We've recently hit this issue in one of our production clusters and with EFS CSI driver version v2.0.5. The reproduction steps are exactly as described above. I am surprised that such a massive issue is being ignored. From my point of view, an immediate fix is imperative.

image

@mskanth972
Copy link
Contributor

deletePath := target + accessPoint.AccessPointRootDir

  • If the AccessPointRootDir is / (or an empty string), the resulting deletePath could point to the root of the entire mounted file system (/var/lib/kubelet/pods/<pod-id>/volumes/kubernetes.io~csi/pvc-<id>/mount).

  • In these cases, the operation os.RemoveAll(deletePath) would try to remove everything under the mount directory, which would include all the data within the mounted EFS volume.

@617m4rc, to confirm this, is your base path empty which is specified in the Storage Class?

@daro1337
Copy link

In my case, the behavior was the same, and the most common error from the application was File stale handle or File Not found (data was being randomly removed from EFS). Some time ago, I reported this error to AWS support, and we have been working together on solving the problem. I managed to prepare a k8s manifest that could replicate this error every time. After testing the latest version (v2.1.2), I can confirm that #1532 solves this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

8 participants