--delete-access-point-root-dir=true might delete wrong dir #1085

MiGrandjean · 2023-08-01T15:01:39Z

/kind bug

What happened?

We deployed aws-efs-csi-driver with --delete-access-point-root-dir=true. At some point all contents of the EFS filesystem got deleted. No /pvc-.... folders were left. The monitoring looks like this:

Note the drop of storage bytes at 16:30.

What you expected to happen?

Don't delete all EFS AccessPoints/RootDirs.

How to reproduce it (as minimally and precisely as possible)?

I'm still trying to reproduce this behavior. Any hints what might have caused this are appreciated.

Anything else we need to know?:

According to the logs, aws-efs-csi-driver tried to delete an existing PVC pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a minutes before the incident, but failed at first:

{"log":"I0726 16:25:04.800727       1 controller.go:1471] delete \"pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a\": started\n","stream":"stderr","time":"2023-07-26T16:25:04.800967025Z"}
{"log":"I0726 16:25:04.800792       1 controller.go:1279] volume pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a does not need any deletion secrets\n","stream":"stderr","time":"2023-07-26T16:25:04.801012807Z"}
{"log":"W0726 16:25:14.801335       1 controller.go:989] Retrying syncing volume \"pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a\", failure 514\n","stream":"stderr","time":"2023-07-26T16:25:14.801483546Z"}
{"log":"E0726 16:25:14.801598       1 controller.go:1007] error syncing volume \"pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a\": rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-07-26T16:25:14.802019594Z"}
{"log":"I0726 16:25:14.801625       1 event.go:285] Event(v1.ObjectReference{Kind:\"PersistentVolume\", Namespace:\"\", Name:\"pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a\", UID:\"ccc26272-7ea2-4ab9-beb0-e1445e8cea1f\", APIVersion:\"v1\", ResourceVersion:\"41843626\", FieldPath:\"\"}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-07-26T16:25:14.802030534Z"}

The logs stop after that. Five minutes later, the EFS filesystem is empty.

I also found a mount error in the logs some hours earlier for the corresponding Access Point:

{"log":"E0725 09:46:14.712594       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not mount \"fs-0188ca211fd8cc57a\" at \"/var/lib/csi/pv/fsap-0e4937bb9f159e1ec\": mount failed: exit status 32\n","stream":"stderr","time":"2023-07-26T09:46:14.712724126Z"}

AFAICS "Could not mount ..." is the error message in the DeleteVolume function in controller.go if the mount fails during the "//Mount File System at it root and delete access point root directory"-logic.
Is it possible that prior to this, not /pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a got deleted, but /? E.g. can AccessPointRootDir be empty, or sth similar?

16:30 UTC is outside of our core working hours and aws-efs-csi-driver is the only tool that operates with delete permissions on the EFS filesystem.

Environment

Kubernetes version (use kubectl version): Server Version: v1.23.17-eks-a5565ad
Driver version: v1.5.5

Please also attach debug logs to help us better diagnose

TBD

The text was updated successfully, but these errors were encountered:

MiGrandjean · 2023-08-03T14:35:30Z

I was able to reproduce the behaviour with enabled debug logs and attached the results of the logcollector as described here:

results.zip

What I did this time:

I deployed aws-efs-csi-driver via the Helm Chart (2.4.3) with these additional values:

controller:
  deleteAccessPointRootDir: true
  logLevel: 5
node:
  logLevel: 5

I enabled stunnel debug in the efs-plugin container on my 1 node

sed -i '/stunnel_debug_enabled = false/s//stunnel_debug_enabled = true/g' /etc/amazon/efs/efs-utils.conf
sed -i '/logging_level = INFO/s//logging_level = DEBUG/g' /etc/amazon/efs/efs-utils.conf

I added some dummy files to the EFS filesystem (using dd) at the top level:

[root@ip-10-1-129-238 ~]# ls -ltr /mnt/efs/
drwx------ 3 rngd input      6144 Aug  2 14:01 pvc-a8f802cb-2bf8-452b-a06d-0835781472eb
drwx------ 3 rngd input      6144 Aug  2 14:27 pvc-a83e4189-5b6d-43b1-855e-c57807220d02
-rw-r--r-- 1 root root  536870912 Aug  3 10:15 dummy_file
-rw-r--r-- 1 root root  536870912 Aug  3 10:16 dummy_file2
-rw-r--r-- 1 root root  536870912 Aug  3 10:16 dummy_file3
drwxr-xr-x 2 root root       6144 Aug  3 10:18 manual_dir

I then deployed 10 instances of our application that uses one PVC per instance. The EFS then looked like this:

[root@ip-10-1-129-238 ~]# ls -ltr /mnt/efs/
total 64
drwx------ 3 rngd input      6144 Aug  2 14:01 pvc-a8f802cb-2bf8-452b-a06d-0835781472eb
drwx------ 3 rngd input      6144 Aug  2 14:27 pvc-a83e4189-5b6d-43b1-855e-c57807220d02
-rw-r--r-- 1 root root  536870912 Aug  3 10:15 dummy_file
-rw-r--r-- 1 root root  536870912 Aug  3 10:16 dummy_file2
-rw-r--r-- 1 root root  536870912 Aug  3 10:16 dummy_file3
drwxr-xr-x 2 root root       6144 Aug  3 10:18 manual_dir
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-5496e45b-7e0c-4e1d-8cfe-22d2d91ec010
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-31b88ebb-aada-4fc5-9664-0528f188c12a
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-17b576c3-fb57-4227-bbd6-896a74831c3c
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-5584947b-06be-49af-9041-aa52a576d1fd
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-2add4d54-2f81-4c79-8b74-04572d7956e6
drwx------ 3 rngd input      6144 Aug  3 11:45 pvc-69ab4ca4-65b4-4902-9b97-9f7a9a3f3e3a
drwx------ 3 rngd input      6144 Aug  3 11:46 pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f
drwx------ 3 rngd input      6144 Aug  3 11:46 pvc-9a5abe17-bf4c-4407-aee3-36c168f52269
drwx------ 3 rngd input      6144 Aug  3 11:46 pvc-061e2c50-1561-4be5-9932-6267fc84760e
drwx------ 3 rngd input      6144 Aug  3 11:46 pvc-1293730a-a206-426a-b396-8619b43c1c58

After the deployment was successful, I uninstalled all 10 instances. After a few minutes, the EFS was empty. Even the manually added files were gone:

[root@ip-10-1-129-238 ~]# ls -lah /mnt/efs/
total 4.0K
drwxr-xr-x 2 root root 6.0K Aug  3 12:25 .
drwxr-xr-x 3 root root   17 Aug  3 10:13 ..

This is our Grafana dashboard with EFS Storage Bytes:

(please note: The timestamps above are UTC, the ones in the Grafana screenshot are UTC+2)

We collect Container logs with Promtail/Loki and I can see these logs coming from the csi-provisioner Container

What I can see in the Loki logs is that it starts to delete the PVC folders and then fails after 10s with volume deletion failed: rpc error: code = DeadlineExceeded

{"log":"I0803 12:24:35.819681       1 controller.go:1471] delete \"pvc-5584947b-06be-49af-9041-aa52a576d1fd\": started\n","stream":"stderr","time":"2023-08-03T12:24:35.819838196Z"}
{"log":"I0803 12:24:36.533382       1 controller.go:1471] delete \"pvc-2add4d54-2f81-4c79-8b74-04572d7956e6\": started\n","stream":"stderr","time":"2023-08-03T12:24:36.533543967Z"}
{"log":"I0803 12:24:36.826410       1 controller.go:1471] delete \"pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f\": started\n","stream":"stderr","time":"2023-08-03T12:24:36.833902331Z"}
{"log":"E0803 12:24:45.820805       1 controller.go:1481] delete \"pvc-5584947b-06be-49af-9041-aa52a576d1fd\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:45.820941694Z"}
{"log":"E0803 12:24:46.534689       1 controller.go:1481] delete \"pvc-2add4d54-2f81-4c79-8b74-04572d7956e6\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:46.53484165Z"}
{"log":"I0803 12:24:46.821598       1 controller.go:1471] delete \"pvc-5584947b-06be-49af-9041-aa52a576d1fd\": started\n","stream":"stderr","time":"2023-08-03T12:24:46.821876582Z"}
{"log":"E0803 12:24:46.834916       1 controller.go:1481] delete \"pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:46.834982286Z"}
{"log":"I0803 12:24:47.535599       1 controller.go:1471] delete \"pvc-2add4d54-2f81-4c79-8b74-04572d7956e6\": started\n","stream":"stderr","time":"2023-08-03T12:24:47.535778358Z"}
{"log":"I0803 12:24:47.835858       1 controller.go:1471] delete \"pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f\": started\n","stream":"stderr","time":"2023-08-03T12:24:47.836008165Z"}
{"log":"E0803 12:24:54.879658       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Failed to Delete volume fs-0f4f885c773bf17d5::fsap-0c4ed3cd16c584676: Failed to delete access point: fsap-0c4ed3cd16c584676, error: RequestCanceled: request context canceled\n","stream":"stderr","time":"2023-08-03T12:24:54.880908443Z"}
2023-08-03T14:24:56+02:00	{"log":"E0803 12:24:56.240292       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Failed to Delete volume fs-0f4f885c773bf17d5::fsap-0aff425573c27b118: Failed to delete access point: fsap-0aff425573c27b118, error: RequestCanceled: request context canceled\n","stream":"stderr","time":"2023-08-03T12:24:56.240412084Z"}
{"log":"E0803 12:24:56.220653       1 controller.go:1481] delete \"pvc-b8f70134-aed2-4e47-bde4-0736c2b9854f\": volume deletion failed: rpc error: code = Internal desc = Could not unmount \"/var/lib/csi/pv/fsap-0aff425573c27b118\": unmount failed: exit status 32\n","stream":"stderr","time":"2023-08-03T12:24:56.220689937Z"}
{"log":"E0803 12:24:56.822007       1 controller.go:1481] delete \"pvc-5584947b-06be-49af-9041-aa52a576d1fd\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:56.822143818Z"}
{"log":"E0803 12:24:57.535838       1 controller.go:1481] delete \"pvc-2add4d54-2f81-4c79-8b74-04572d7956e6\": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n","stream":"stderr","time":"2023-08-03T12:24:57.536657442Z"}

Another obersvation: Although the EFS filesystem is empty, AWS still shows the AccessPoints:

FTR: I was only able to reproduce this with --delete-access-point-root-dir=true, but not with --delete-access-point-root-dir=false, so suspect it has something to do with this setting.

RyanStan · 2023-08-04T20:06:54Z

Thanks for bringing this to our attention. We'll investigate this immediately.

RyanStan · 2023-08-08T17:38:41Z

Using the latest Helm Chart release with out v1.5.5 image, I haven't been able to recreate this issue.

Could you try reproducing the issue with application Pods that don't run your workload? For example, an Amazon Linux image with no arguments. This will help us figure out whether it's the workload that is deleting the files.

If we're unable to reproduce the issue, then I recommend opening a technical support case with AWS. That will allow us to have a private channel to investigate further.

makomatic · 2023-09-13T11:27:03Z

Could you try reproducing the issue with application Pods that don't run your workload? For example, an Amazon Linux image with no arguments.

I know that I won't convince you by just telling - but I ensure you that it's not the workload. We're running this for years for many many customers in very big production environments. This never happened and I am damn sure that the application does not delete its root ;-)

We deviced to not make use of this feature anymore and clean up by hand - which is kinda disturbing, but - to be honest with you - we plan to change the storage provider anyway and we will not reproduce it again in our environment.
We also can't open another case with AWS (we had one when it happened) because we currently have no problem.

Anyway thx for checking it out.

mskanth972 · 2023-12-05T17:03:31Z

Thank you for reporting this issue to us. Our team has diligently tried to recreate the scenario as you described, but we have not been successful in replicating the problem in our own environment. Additionally, after reviewing the code, it appears that the deletion process should not result in the removal of all contents. Given that we cannot reproduce the issue, our ability to provide further assistance is restricted. Therefore, we will be closing this issue at this time.

EinavDanielDX · 2024-09-11T13:14:31Z

We are having a similar issue in our testing cluster. it's most frustrating as it's not consistent.
In our environment there are constant E2E tests, creating and deleting namespaces and workloads.
Every once in a while the deletion of a namespace and its resources (always gracefully) results in data termination of all of the files in EFS, even though the access points are not deleted.

In this screenshot you can see it happened 3 times in that last week, data is restored manually each time.

617m4rc · 2024-11-22T17:42:38Z

We've recently hit this issue in one of our production clusters and with EFS CSI driver version v2.0.5. The reproduction steps are exactly as described above. I am surprised that such a massive issue is being ignored. From my point of view, an immediate fix is imperative.

mskanth972 · 2024-12-10T17:33:20Z

deletePath := target + accessPoint.AccessPointRootDir

If the AccessPointRootDir is / (or an empty string), the resulting deletePath could point to the root of the entire mounted file system (/var/lib/kubelet/pods/<pod-id>/volumes/kubernetes.io~csi/pvc-<id>/mount).
In these cases, the operation os.RemoveAll(deletePath) would try to remove everything under the mount directory, which would include all the data within the mounted EFS volume.

@617m4rc, to confirm this, is your base path empty which is specified in the Storage Class?

daro1337 · 2024-12-19T13:28:51Z

In my case, the behavior was the same, and the most common error from the application was File stale handle or File Not found (data was being randomly removed from EFS). Some time ago, I reported this error to AWS support, and we have been working together on solving the problem. I managed to prepare a k8s manifest that could replicate this error every time. After testing the latest version (v2.1.2), I can confirm that #1532 solves this problem.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 1, 2023

This was referenced Aug 16, 2023

AWS - bans customers when criticized #1098

Closed

EFS CSI Driver: Abandonware? #1099

Closed

mskanth972 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 5, 2023

mkilchhofer mentioned this issue Sep 11, 2024

Added checks to make sure delete-access-point should not delete entir… #1201

Open

mskanth972 reopened this Dec 10, 2024

jrakas-dev mentioned this issue Dec 18, 2024

Modify delete access point root directory logic to only remove temporary directory if empty #1532

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--delete-access-point-root-dir=true might delete wrong dir #1085

--delete-access-point-root-dir=true might delete wrong dir #1085

MiGrandjean commented Aug 1, 2023

MiGrandjean commented Aug 3, 2023 •

edited

Loading

RyanStan commented Aug 4, 2023

RyanStan commented Aug 8, 2023

makomatic commented Sep 13, 2023

mskanth972 commented Dec 5, 2023

EinavDanielDX commented Sep 11, 2024 •

edited

Loading

617m4rc commented Nov 22, 2024

mskanth972 commented Dec 10, 2024

daro1337 commented Dec 19, 2024

--delete-access-point-root-dir=true might delete wrong dir #1085

--delete-access-point-root-dir=true might delete wrong dir #1085

Comments

MiGrandjean commented Aug 1, 2023

MiGrandjean commented Aug 3, 2023 • edited Loading

What I did this time:

RyanStan commented Aug 4, 2023

RyanStan commented Aug 8, 2023

makomatic commented Sep 13, 2023

mskanth972 commented Dec 5, 2023

EinavDanielDX commented Sep 11, 2024 • edited Loading

617m4rc commented Nov 22, 2024

mskanth972 commented Dec 10, 2024

daro1337 commented Dec 19, 2024

MiGrandjean commented Aug 3, 2023 •

edited

Loading

EinavDanielDX commented Sep 11, 2024 •

edited

Loading