You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When an instance that's hosting a docker alloc that is using a CSI EBS single-node-writer volume is terminated it gets stuck in pending state when it's re-scheduled. Of note that when the instance is drained the alloc is re-scheduled successfully, so the issue could be around the alloc being in a lost state or the volume claim not being released.
I'd been experiencing the issue around #10927, so this may be related to something around #11932
$ nomad alloc status -verbose f2f979c5
ID = f2f979c5-1bb4-b4c0-7dcc-e56b95b8b240
Eval ID = cac78636-7512-f0b5-d106-5d9b179a3825
Name = postgres.postgresql[0]
Node ID = 85f13716-0bb9-8e3a-fa6c-c356ae46c881
Node Name = i-0854e29d9017468d5
Job ID = postgres
Job Version = 0
Client Status = pending
Client Description = No tasks have started
Desired Status = run
Desired Description = <none>
Created = 2022-06-17T10:37:07-07:00
Modified = 2022-06-17T10:42:07-07:00
Evaluated Nodes = 2
Filtered Nodes = 0
Exhausted Nodes = 0
Allocation Time = 1.557752ms
Failures = 0
Allocation Addresses (mode = "bridge")
Label Dynamic Address
*connect-proxy-postgresql yes 172.31.31.11:20154 -> 20154
Task "connect-proxy-postgresql" (prestart sidecar) is "pending"
Task Resources
CPU Memory Disk Addresses
250 MHz 128 MiB 300 MiB
Task Events:
Started At = N/A
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2022-06-17T10:37:07-07:00 Received Task received by client
Task "postgresql" is "pending"
Task Resources
CPU Memory Disk Addresses
500 MHz 1.0 GiB 300 MiB
CSI Volumes:
Name ID Plugin Provider Schedulable Read Only Mount Options
pg_data pg_data aws-ebs ebs.csi.aws.com true false fs_type: ext4
Task Events:
Started At = N/A
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2022-06-17T10:37:07-07:00 Received Task received by client
Placement Metrics
Node binpack job-anti-affinity node-affinity node-reschedule-penalty final score
85f13716-0bb9-8e3a-fa6c-c356ae46c881 0.825 0 0 0 0.825
1e10e7c0-c1e0-fed4-941c-a18e5b1b7110 0.696 0 0 0 0.696
Reproduction steps
Launch an alloc with a EBS single-node-writer.
Terminate the instance on which it's running.
Watch it not proceed past the pending state.
Expected Result
Alloc is re-scheduled successfully on another client node.
Actual Result
Alloc is stuck in pending on another client node.
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: 2022-06-17T16:33:41.957Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=pg_data
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: error=
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: | 1 error occurred:
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: | #11* missing external node ID: Unknown node: 66694e7d-0ccc-6fef-a699-1da135600b41
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: |
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]:
Jun 17 16:54:10 ip-172-31-20-18 nomad[1537]: 2022-06-17T16:54:10.264Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
The text was updated successfully, but these errors were encountered:
When an instance that's hosting a docker alloc that is using a CSI EBS single-node-writer volume is terminated it gets stuck in pending state when it's re-scheduled.
Hi @nicolasscott! The node plugin on the original client must be running in order to free the claim. When a node terminates, Nomad can't know whether the node is simply temporarily disconnected or will never return. To do otherwise risks data corruption for volumes that can be written to from multiple hosts (ex. a NFS volume). If you're intentionally terminating the node, we strongly recommend draining it first.
However, we've decided to change the behavior so that we bypass the node unpublish step because the other major CSI implementation does it that way. So that's in #13301 and will ship in Nomad 1.3.2. I'm going to close this issue as a duplicate of #13264. Thanks!
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v1.3.1 (2b054e3)
Operating system and Environment details
Ubuntu 20.04.4 LTS on AWS EC2
Issue
When an instance that's hosting a docker alloc that is using a CSI EBS single-node-writer volume is terminated it gets stuck in pending state when it's re-scheduled. Of note that when the instance is drained the alloc is re-scheduled successfully, so the issue could be around the alloc being in a lost state or the volume claim not being released.
I'd been experiencing the issue around #10927, so this may be related to something around #11932
Reproduction steps
Launch an alloc with a EBS single-node-writer.
Terminate the instance on which it's running.
Watch it not proceed past the pending state.
Expected Result
Alloc is re-scheduled successfully on another client node.
Actual Result
Alloc is stuck in pending on another client node.
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: 2022-06-17T16:33:41.957Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=pg_data
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: error=
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: | 1 error occurred:
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: | #11* missing external node ID: Unknown node: 66694e7d-0ccc-6fef-a699-1da135600b41
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]: |
Jun 17 16:33:41 ip-172-31-20-18 nomad[1537]:
Jun 17 16:54:10 ip-172-31-20-18 nomad[1537]: 2022-06-17T16:54:10.264Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
The text was updated successfully, but these errors were encountered: