-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI: Migrating Allocation w/ CSI Volume to New Node Fails #8080
Comments
Hi @kainoaseto! I'm looking into this! Possibly related: #8232 |
After doing a little bit more testing, it looks like the other failure mode we can see here is that the
|
@kainoaseto I'm investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.
|
Hi @tgross Thank you for looking into this! Ah that's interesting, I believe I've ran into that one at some point in time with mixed results on reproducing it consistently. Interested to how that one will be resolved, I'm guessing some type of synchronous call or wait for all the volumes to be unpublished before completing the drain would need to happen. Thank you for the work around! I will test that as a possible solution in the meantime. One problem that I might run into still, is that the nomad clients that are used for this cluster are spot instances (provisioned through spot.io/spotinst). We get about a 5 minute drain time before the instance gets terminated but I believe that forces system jobs to drain off as well which is another way I've been able to reproduce this issue. I'll see if it's possible to modify the drain to ignore system jobs for spot to help facilitate this fix but I'm not sure if that's possible. The controllers I've tried running multiple of but if one of those gets shutdown (maybe the node it was on drained it to start up on a new node) I've ran into this bug: #8034 Although it looks like there is a workaround with restarting the controllers if they get into a bad state. |
There's a status update on this issue here: #8232 (comment) |
For sake of our planning, I'm going to close this issue and we'll continue track the progress in #8232. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)
Operating system and Environment details
Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64
EBS CSI Plugin:
Issue
What I've observed is that when a nomad client is drained along with its system jobs (nomad client will be shutdown or the node is marked ineligible and draining for both system jobs and normal jobs occurs) the CSI Volume gets into a bad state.
When the job that owns the CSI volume is being removed from the node, the
ControllerUnpublishVolume
RPC call is executed but is canceled via context:Because this call never completes the EBS volume is still attached to the old node/instance. When the job is scheduled onto another node the controller gets a call to
ControllerPublishVolume
but this fails with theAlreadyExists
error:I looked into this error a little bit and according to the CSI spec it should "Indicate that a volume corresponding to the specified volume_id has already been published at the node corresponding to the specified node_id but is incompatible with the specified volume_capability or readonly flag."
However, this is actually implemented incorrectly in the CSI EBS plugin. It seems like they should be throwing the
FAILED_PRECONDITION
error instead since that seems to more accurately reflect that the volume is still attached on a different node under the CSI Spec.I've filed this issue on the EBS CSI plugin to get their input since it seems that might be out of spec. Here is also the line of code that shows why this error is being thrown (since the volume is just still attached to the other node).
Anyways, once that was cleared up looking back at the CSI controller logs I was able to see that Nomad seems to be calling
ControllerUnpublishVolume
but alsoControllerPublishVolume
before the unpublish is able to complete and free the volume. This results in this sometimes endless loop of theControllerUnpublishVolume
trying to complete before theControllerPublishVolume
executes and seems to cancel it. Which in some cases never happens and in others will resolve after a few tries. Here is a small snippet of logs to show this:Just to reiterate, this issue happens only when the volume is attached still to another Node (EC2 instance) and isn't able to be removed by the controller because the request is being cancelled. This will sometimes complete when the volume can be detached successfully before the
ControllerPublishVolume
call goes through but it seems somewhat random in how this happens from first observations.Reproduction steps
The easiest way I've found to reproduce this issue somewhat consistently is by scheduling a job that uses a CSI volume and then doing a non forced 3 minute drain on the node that includes system jobs. When the job that needs the CSI volume is rescheduled to a new node in the same az, this issue can be observed.
Register an EBS volume in the same region/az, I followed the steps here with
us-west-2b-a
as my volume id/name:https://learn.hashicorp.com/nomad/stateful-workloads/csi-volumes
Register the mysql job:
ControllerPublishVolume
andControllerUnpublishVolume
are being called and stepping on each other. However it might take some tries as each time there can be different results from what I've found. Sometimes it will self resolve if the unpublish is called early enough or if the volume is successfully detached when the node is draining.Nomad Task Event Log
The text was updated successfully, but these errors were encountered: