-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Volume is not released during node drain, causing job to fail forever #8232
Comments
Hi @mishel170! A couple questions so that I can pin down the behavior:
|
Possibly related: #8080 |
Yes it seems to be the similar
They run as system jobs and it happens when the controller runs on the same node Thanks for helping |
@mishel170 I'm still investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.
|
thanks @tgross Unfortunately running the controller as system job didn't help Looking forward to an update, |
It looks like there's a couple different problems in the node drain scenario:
What I'm attempting to do to fix this:
|
I wanted to give a little status update on where this is. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:
I've tested this out and the results are... better? Not as good as I'd like yet. As explained in #8232 (comment), the first replacement allocation that gets placed during a node drain fails and we can't do much about that at the moment. That error looks like this:
When this happens, we're still in the process of detaching the volume from the other node. But the next allocation should succeed to place but instead we're getting an error at first. But the error recovers after several attempts. From the logs I suspect we're seeing an error because the kernel hasn't picked up that the storage provider attached the device yet (or the EC2 API hasn't made the device available yet, in this case) . Because there's only a single claim on the volume we're going thru the whole controller detach cycle with each attempt. I'll be investigating that on Monday to see if we can improve the situation at all. In the example below, the drained node is
Allocations
Logs from the node plugin for the first failed allocation
vs the successful run for allocation
There are no other errors at the client The controller logs all seem fine. controller logs a634b3c5initial failed attempt:
successful attempt from controller:
controller logs 7c4c4267successful attempt:
|
I've also opened #8609 to cover the issue that we're not going to be able to fix in the 0.12.x timeframe. |
Ok, one last item that isn't going to make it into the 0.12.2 release but can be worked around is that if you have HA controllers you need to make sure they don't land on the same client during node drains: #8628 But otherwise testing is showing that the work that's landed for 0.12.2 should close this out. That will be released shortly. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Hey,
Using nomad version 0.11.3 and agents with an AWS ebs volume attached, while trying to drain the node, sometimes (happens very often) the volume can not be released from the ec2 instance causing the following error during rescheduling:
failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: volume "vol-0e5f3d9ecd9a" is already published at node "i-0cf3069676ehgy5i2" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists
Operating system and Environment details
Amazon Linux 2 AMI 2.0.20190313 x86_64 HVM gp2
Best regards,
Mishel Liberman
The text was updated successfully, but these errors were encountered: