csi: make volume GC in job deregister safely async #7632

tgross · 2020-04-05T14:49:17Z

Partial fix for #7629
Includes commits from #7628

The Job.Deregister call will block on the client CSI controller RPCs while the alloc still exists on the Nomad client node. So we need to make the volume claim reaping async from the Job.Deregister. This allows nomad job stop to return immediately. In order to make this work, I've changed the volume GC so that the GC jobs are on a by-volume basis rather than a by-job basis; we won't have to query the (possibly deleted) job at the time of volume GC. I'm smuggling the volume ID and whether it's a purge into the GC eval ID the same way we smuggled the job ID previously.

This doesn't entirely fix #7629 because the first GC attempt for volumes with a controller with fail and be re-queued. I've decreased the client-side controller timeout to reduce the wait time as a stop-gap, but we'll want to revisit it.

This leaves the E2E tests with nomad stop -purge flaky, depending on timing. For now I've changed them so as not to purge until we've corrected the problem more permanently.

tgross · 2020-04-05T19:06:46Z

Passing e2e:

▶ go test -v . -suite=CSI
=== RUN   TestE2E
=== RUN   TestE2E/Affinity
    TestE2E/Affinity: framework.go:181: skipping suite 'Affinity': only running suite "CSI"
=== RUN   TestE2E/clientstate
    TestE2E/clientstate: framework.go:181: skipping suite 'clientstate': only running suite "CSI"
=== RUN   TestE2E/Connect
    TestE2E/Connect: framework.go:181: skipping suite 'Connect': only running suite "CSI"
=== RUN   TestE2E/ConnectACLs
    TestE2E/ConnectACLs: framework.go:181: skipping suite 'ConnectACLs': only running suite "CSI"
=== RUN   TestE2E/Consul
    TestE2E/Consul: framework.go:181: skipping suite 'Consul': only running suite "CSI"
=== RUN   TestE2E/Consul_Template
    TestE2E/Consul_Template: framework.go:181: skipping suite 'Consul Template': only running suite "CSI"
=== RUN   TestE2E/CSI
=== RUN   TestE2E/CSI/*csi.CSIVolumesTest
=== RUN   TestE2E/CSI/*csi.CSIVolumesTest/TestEBSVolumeClaim
=== RUN   TestE2E/CSI/*csi.CSIVolumesTest/TestEFSVolumeClaim
=== RUN   TestE2E/Deployment
    TestE2E/Deployment: framework.go:181: skipping suite 'Deployment': only running suite "CSI"
=== RUN   TestE2E/simple
    TestE2E/simple: framework.go:181: skipping suite 'simple': only running suite "CSI"
=== RUN   TestE2E/Host_Volumes
    TestE2E/Host_Volumes: framework.go:181: skipping suite 'Host Volumes': only running suite "CSI"
=== RUN   TestE2E/Metrics
    TestE2E/Metrics: framework.go:181: skipping suite 'Metrics': only running suite "CSI"
=== RUN   TestE2E/Nomad_exec
    TestE2E/Nomad_exec: framework.go:181: skipping suite 'Nomad exec': only running suite "CSI"
=== RUN   TestE2E/Spread
    TestE2E/Spread: framework.go:181: skipping suite 'Spread': only running suite "CSI"
=== RUN   TestE2E/SystemScheduler
    TestE2E/SystemScheduler: framework.go:181: skipping suite 'SystemScheduler': only running suite "CSI"
=== RUN   TestE2E/TaskEvents
    TestE2E/TaskEvents: framework.go:181: skipping suite 'TaskEvents': only running suite "CSI"
--- PASS: TestE2E (119.37s)
    --- SKIP: TestE2E/Affinity (0.00s)
    --- SKIP: TestE2E/clientstate (0.00s)
    --- SKIP: TestE2E/Connect (0.00s)
    --- SKIP: TestE2E/ConnectACLs (0.00s)
    --- SKIP: TestE2E/Consul (0.00s)
    --- SKIP: TestE2E/Consul_Template (0.00s)
    --- PASS: TestE2E/CSI (119.36s)
        --- PASS: TestE2E/CSI/*csi.CSIVolumesTest (119.28s)
            --- PASS: TestE2E/CSI/*csi.CSIVolumesTest/TestEBSVolumeClaim (60.69s)
            --- PASS: TestE2E/CSI/*csi.CSIVolumesTest/TestEFSVolumeClaim (58.40s)
    --- SKIP: TestE2E/Deployment (0.00s)
    --- SKIP: TestE2E/simple (0.00s)
    --- SKIP: TestE2E/Host_Volumes (0.00s)
    --- SKIP: TestE2E/Metrics (0.00s)
    --- SKIP: TestE2E/Nomad_exec (0.00s)
    --- SKIP: TestE2E/Spread (0.00s)
    --- SKIP: TestE2E/SystemScheduler (0.00s)
    --- SKIP: TestE2E/TaskEvents (0.00s)
PASS
ok      github.com/hashicorp/nomad/e2e  119.484s

langmartin

This makes sense, and looks good to me

The CSI plugins uses the external volume ID for all operations, but the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the mount paths. Pass the External ID as an arg in the RPC call so that the unpublish workflows have it without calling back to the server to find the external ID. The controller CSI plugins need the CSI node ID (or in other words, the storage provider's view of node ID like the EC2 instance ID), not the Nomad node ID, to determine how to detach the external volume.

The `Job.Deregister` call will block on the client CSI controller RPCs while the alloc still exists on the Nomad client node. So we need to make the volume claim reaping async from the `Job.Deregister`. This allows `nomad job stop` to return immediately. In order to make this work, this changeset changes the volume GC so that the GC jobs are on a by-volume basis rather than a by-job basis; we won't have to query the (possibly deleted) job at the time of volume GC. We smuggle the volume ID and whether it's a purge into the GC eval ID the same way we smuggled the job ID previously.

github-actions · 2023-01-10T02:18:00Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross force-pushed the csi_avoid_blocking_deregister branch 2 times, most recently from bf8e1ec to b4ae84b Compare April 5, 2020 19:04

tgross force-pushed the csi_avoid_blocking_deregister branch from b4ae84b to d569d6a Compare April 5, 2020 19:26

tgross mentioned this pull request Apr 5, 2020

csi: controller plugin timeouts #7629

Closed

tgross added the theme/storage label Apr 5, 2020

tgross mentioned this pull request Apr 5, 2020

csi: unpublish workflow ID mismatches #7626

Closed

tgross marked this pull request as ready for review April 5, 2020 19:27

tgross requested a review from langmartin April 5, 2020 19:27

tgross added this to the 0.11.0 milestone Apr 5, 2020

langmartin approved these changes Apr 6, 2020

View reviewed changes

tgross added 3 commits April 6, 2020 09:40

e2e/csi: add waiting for alloc stop

97f1ccd

tgross force-pushed the csi_avoid_blocking_deregister branch from d569d6a to 97f1ccd Compare April 6, 2020 13:41

tgross merged commit f858d4e into master Apr 6, 2020

tgross deleted the csi_avoid_blocking_deregister branch April 6, 2020 14:15

tgross mentioned this pull request Apr 6, 2020

csi: unpublish workflow ID mismatches #7628

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csi: make volume GC in job deregister safely async #7632

csi: make volume GC in job deregister safely async #7632

tgross commented Apr 5, 2020 •

edited

Loading

tgross commented Apr 5, 2020

langmartin left a comment

github-actions bot commented Jan 10, 2023

csi: make volume GC in job deregister safely async #7632

csi: make volume GC in job deregister safely async #7632

Conversation

tgross commented Apr 5, 2020 • edited Loading

tgross commented Apr 5, 2020

langmartin left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 10, 2023

tgross commented Apr 5, 2020 •

edited

Loading