CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart #11477

BlizzTom · 2021-11-09T00:09:00Z

Nomad version

Nomad v1.1.6 (b83d623fb5ff475d5e40df21e9e7a61834071078)

Issue is also present in 1.1.2 to 1.2.0 Beta.

Operating system and Environment details

Linux <hostname> 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Issue

When restarting nomad without a drain using a CSI plugin and a mounted volume, nomad will fail to restore the allocation and leave the process running.

Reproduction steps

Use a CSI plugin to mount a volume to a task
Restart the nomad process without draining the node

Expected Result

Allocation is restored

Actual Result

Allocation is failed, but process remains running, volume remains mounted.

Nomad Client logs (if appropriate)

2021-11-08T20:51:21.310Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
2021-11-08T20:51:21.310Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
2021-11-08T20:51:21.310Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
2021-11-08T20:51:21.338Z [WARN]  client.server_mgr: no servers available
2021-11-08T20:51:21.352Z [INFO]  client: started client: node_id=cb1b133f-724b-31b5-a4a2-226dcb11811e
2021-11-08T20:51:21.353Z [INFO]  client.gc: marking allocation for GC: alloc_id=40aa9f7a-fc98-f038-7e33-2778a00cf3b9
2021-11-08T20:51:21.355Z [WARN]  client.server_mgr: no servers available
2021-11-08T20:51:21.355Z [ERROR] client.alloc_runner: prerun failed: alloc_id=af3e5bb3-d229-1a3a-083d-f47304e30cf8 error="pre-run hook "csi_hook" failed: claim volumes: no servers"
2021-11-08T20:51:21.356Z [INFO]  agent.joiner: starting retry join: servers=nomad.service.cloud-insight.dmz.discovery.blizzard.net
2021-11-08T20:51:21.357Z [WARN]  client.server_mgr: no servers available
2021/11/08 20:51:21.359986 [INFO] (runner) creating new runner (dry: false, once: false)
2021/11/08 20:51:21.360571 [INFO] (runner) creating watcher
2021/11/08 20:51:21.360776 [INFO] (runner) starting
2021-11-08T20:51:21.362Z [INFO]  client.gc: marking allocation for GC: alloc_id=af3e5bb3-d229-1a3a-083d-f47304e30cf8
2021-11-08T20:51:21.385Z [INFO]  agent.joiner: retry join completed: initial_servers=1 agent_mode=client
2021-11-08T20:51:21.637Z [INFO]  client: node registration complete

Possibly related to #10833

Specifically it appears that the csi_hook prerun requires that the retry join has completed to make the RPC call CSIVolume.Claim. However, there is a race in the go routines for the retry join and the restore allocations.

The text was updated successfully, but these errors were encountered:

tgross · 2021-11-09T13:56:03Z

Hi @BlizzTom! This does seem to be related to #10833, but I don't think I expected to see that in the case where the client has simply restarted and not been marked lost.

tgross · 2022-02-03T17:54:12Z

Following up on this because #10833 has been closed out: on further review it's pretty clear we should be handling the case where the servers are disconnected more safely. The changes in #11892 will partially help here. But we'll also need this work upcoming work on disconnected client handling anyways. I'll be looking into this as part of other plugin work going on this next few weeks.

tgross · 2022-02-23T21:58:40Z

Will be fixed by #12113, expected to ship in Nomad 1.3.0

github-actions · 2022-10-11T02:43:04Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

BlizzTom added the type/bug label Nov 9, 2021

tgross added the theme/storage label Nov 9, 2021

tgross changed the title ~~Nomad CSI Zombies Allocations on restart~~ CSI: allocrunner fails to restore after client restart Feb 3, 2022

tgross changed the title ~~CSI: allocrunner fails to restore after client restart~~ CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart Feb 3, 2022

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Feb 3, 2022

tgross self-assigned this Feb 17, 2022

tgross mentioned this issue Feb 23, 2022

CSI: retry claims from client #12113

Merged

tgross added this to the 1.3.0 milestone Feb 23, 2022

tgross mentioned this issue Feb 24, 2022

Jobs using CSI volume do not recover from the client failure without human intervention #12118

Closed

tgross closed this as completed in #12113 Feb 24, 2022

zizon mentioned this issue Mar 16, 2022

CSI Plugin Task Should restore first #12265

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart #11477

CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart #11477

BlizzTom commented Nov 9, 2021 •

edited

Loading

tgross commented Nov 9, 2021

tgross commented Feb 3, 2022

tgross commented Feb 23, 2022

github-actions bot commented Oct 11, 2022

CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart #11477

CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart #11477

Comments

BlizzTom commented Nov 9, 2021 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Client logs (if appropriate)

tgross commented Nov 9, 2021

tgross commented Feb 3, 2022

tgross commented Feb 23, 2022

github-actions bot commented Oct 11, 2022

BlizzTom commented Nov 9, 2021 •

edited

Loading