-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to detach ceph csi volume from a down node and migrate to another #13450
Comments
Hi @enaftali2! This issue has been fixed in #13301 and will ship in Nomad 1.3.2. Essentially the problem is that there's no way for the server to send a node unpublish command to the node plugin that's running on a down node without violating the CSI spec. We've decided to break strict compliance in order to make non-graceful shutdown work. In the meantime, you can avoid this condition by draining a node before shutting it down. |
Hi @tgross , thanks, you were very helpful, since i saw the fix were merged to master i built a new binary with the fix i need from master, the issue is fixed, the cluster behaving as expected. |
That timeout is governed by the client heartbeat timeout, which isn't currently configurable. You can also force your jobs to immediately stop by setting a
Use a non-single-node |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.3.0
Operating system and Environment details
Ubuntu 18.04.6 LTS
Issue
Hi
We are testing ceph storage with nomad volume csi plugin, for the POC iv'e created 3 vm's on GCP with ceph cluster and nomad cluster with client and server role on all 3 vm's, the csi plugin and volumes creation and attachment work very well, i'm running mysql job, when i restart the node running the sql i can see the job migrating to another node with the volume and data.
FYI - to run the csi, sql and volumes creation i used the guide in ceph documentation - https://docs.ceph.com/en/latest/rbd/rbd-nomad/
The issue start when i perform
shutdown -h now
on the node running the sql , after about 10 minutes the allocation marked as Lost and new allocation is trying to start and get stuck on Pending status forever.As you can see in the logs below, nomad fails to detach the volume from node that is currently down.
I want to also mention the even tough ceph also lost 1 node in the test i run , it seems working and accessible and there's no errors in the csi plugin.
Reproduction steps
Shutdown of the machine running a job with volume create from ceph by csi plugin.
Expected Result
job with external volume to migrate to another node with the same volume attached.
Actual Result
job try to migrate and stuck on pending.
CSI Job files
ceph-csi-plugin-controller.nomad
ceph-csi-plugin-nodes.nomad
Volume file
ceph-volume.hcl
Mysql job file
mysql.nomad
Nomad logs
The text was updated successfully, but these errors were encountered: