-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--delete-access-point-root-dir=true might delete wrong dir #1085
Comments
I was able to reproduce the behaviour with enabled debug logs and attached the results of the logcollector as described here: What I did this time:
controller:
deleteAccessPointRootDir: true
logLevel: 5
node:
logLevel: 5
This is our Grafana dashboard with EFS Storage Bytes: (please note: The timestamps above are UTC, the ones in the Grafana screenshot are UTC+2)
What I can see in the Loki logs is that it starts to delete the PVC folders and then fails after 10s with
FTR: I was only able to reproduce this with |
Thanks for bringing this to our attention. We'll investigate this immediately. |
Using the latest Helm Chart release with out v1.5.5 image, I haven't been able to recreate this issue. Could you try reproducing the issue with application Pods that don't run your workload? For example, an Amazon Linux image with no arguments. This will help us figure out whether it's the workload that is deleting the files. If we're unable to reproduce the issue, then I recommend opening a technical support case with AWS. That will allow us to have a private channel to investigate further. |
I know that I won't convince you by just telling - but I ensure you that it's not the workload. We're running this for years for many many customers in very big production environments. This never happened and I am damn sure that the application does not delete its root ;-) We deviced to not make use of this feature anymore and clean up by hand - which is kinda disturbing, but - to be honest with you - we plan to change the storage provider anyway and we will not reproduce it again in our environment. Anyway thx for checking it out. |
Thank you for reporting this issue to us. Our team has diligently tried to recreate the scenario as you described, but we have not been successful in replicating the problem in our own environment. Additionally, after reviewing the code, it appears that the deletion process should not result in the removal of all contents. Given that we cannot reproduce the issue, our ability to provide further assistance is restricted. Therefore, we will be closing this issue at this time. |
@617m4rc, to confirm this, is your base path empty which is specified in the Storage Class? |
In my case, the behavior was the same, and the most common error from the application was |
/kind bug
What happened?
We deployed aws-efs-csi-driver with
--delete-access-point-root-dir=true
. At some point all contents of the EFS filesystem got deleted. No/pvc-....
folders were left. The monitoring looks like this:Note the drop of storage bytes at 16:30.
What you expected to happen?
Don't delete all EFS AccessPoints/RootDirs.
How to reproduce it (as minimally and precisely as possible)?
I'm still trying to reproduce this behavior. Any hints what might have caused this are appreciated.
Anything else we need to know?:
According to the logs, aws-efs-csi-driver tried to delete an existing PVC
pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a
minutes before the incident, but failed at first:The logs stop after that. Five minutes later, the EFS filesystem is empty.
I also found a mount error in the logs some hours earlier for the corresponding Access Point:
AFAICS "Could not mount ..." is the error message in the
DeleteVolume
function in controller.go if the mount fails during the "//Mount File System at it root and delete access point root directory"-logic.Is it possible that prior to this, not
/pvc-957369e3-1128-4e25-a6d5-e6b1e7a6299a
got deleted, but/
? E.g. canAccessPointRootDir
be empty, or sth similar?16:30 UTC is outside of our core working hours and aws-efs-csi-driver is the only tool that operates with delete permissions on the EFS filesystem.
Environment
kubectl version
): Server Version: v1.23.17-eks-a5565adPlease also attach debug logs to help us better diagnose
TBD
The text was updated successfully, but these errors were encountered: