-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI: internal plugin errors aren't exposed to operator #7424
Comments
The work done in #7549 for #6863 shows the places where we could hook better error handling. In the CSI RPCs that can be retries, we only do so for timeout, |
The intention of the |
An example where we could be doing better without having to radically rework how we interface with the RPCs is #7931 (comment), where we can't communicate with the plugin because of file permissions on the socket. |
I spent the morning digging through the hooks we can get via gRPC and actually there's nothing else we can do here with the messages we're getting back. What we can do is to make sure that the server RPCs are wrapping messages we get back from the clients nicely so that the CLI user gets better feedback when it's available. And we can give some direction to check the allocation logs. Also note that since we opened this issue we implemented #7547, which threads some of these client-side messages up through the node events and that actually makes this kind of issue quite a bit better:
|
Will be closed by #7984 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If a plugin has an internal error, we log a gRPC error code but that's the only information that we can get according to the CSI specification. An example I encountered was when I stopped a job but the EC2 IAM instance role did not have
DetachVolume
permissions; the job stopped but the EBS volume was still attached to the EC2 instance.When the
ControllerUnpublishVolume
was called, the client logs show the following:But if we look at the controller plugin's alloc logs they show the real problem:
(logs redacted and line-broken for readability)
There's two things to fix here:
The text was updated successfully, but these errors were encountered: