Use lock per VolumeId during NodeUnstageVolume operation to avoid conflicts during volume unmount #2811

chethanv28 · 2024-03-01T19:52:03Z

What this PR does / why we need it:
PR is adding usage of locking mechanism per VolumeId during NodeUnstageVolume operation.
There have few observations where the 1st NodeUnstageVolume call takes more time and is till going on. Meanwhile, k8s will issue a 2nd NodeUnstageVolume call assuming the 1st NodeUnstageVolume has timed out. The 2nd call succeeds as the target Mountpoint is not found. Therefore, a DetachVolume will be invoked while the 1st NodeUnstageVolume is still in-progress and in-turn corrupts the volume. To avoid the above issue, we can keep a lock per VolumeID during the NodeUnstageVolume operation.

A similar locking mechanism is applied to NodePublish, NodeUnPublish, NodeStage operations as well

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Testing done:
Running e2e pipeline

Release note:

Use lock per VolumeId during NodeUnstageVolume operation

svcbot-qecnsdp · 2024-03-01T23:03:00Z

Started vanilla Block pipeline... Build Number: 2555

pkg/csi/service/common/volume_lock.go

svcbot-qecnsdp · 2024-03-02T04:50:27Z

Block vanilla build status: FAILURE 
Stage before exit: e2e-tests 
Jenkins E2E Test Results: 
------------------------------

Ran 1 of 816 Specs in 400.726 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 815 Skipped
PASS

Ginkgo ran 1 suite in 7m45.155169578s
Test Suite Passed
--
Ran 14 of 816 Specs in 12384.724 seconds
FAIL! -- 7 Passed | 7 Failed | 0 Pending | 802 Skipped
--- FAIL: TestE2E (12384.82s)
FAIL

Ginkgo ran 1 suite in 3h26m40.870854721s

Test Suite Failed

svcbot-qecnsdp · 2024-03-04T20:08:53Z

Started Vanilla block pre-checkin pipeline... Build Number: 2682

pkg/csi/service/node.go

deepakkinni · 2024-03-15T15:40:09Z

/approve

lipingxue · 2024-03-15T15:47:42Z

@chethanv28 Change looks good to me. Are you able to repro the issue locally and test it with your fix to make sure the issue we observed in SR is fixed?

svcbot-qecnsdp · 2024-03-19T15:26:43Z

Started Vanilla block pre-checkin pipeline... Build Number: 2700

svcbot-qecnsdp · 2024-03-19T16:06:58Z

Build ID: 2700
Block vanilla build status: SUCCESS 
Stage before exit: e2e-tests 
Jenkins E2E Test Results: 
------------------------------

Ran 1 of 816 Specs in 1889.692 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 815 Skipped
PASS

Ginkgo ran 1 suite in 32m27.542882256s
Test Suite Passed

pkg/common/cns-lib/node/volume_operation_lock.go

pkg/csi/service/node.go

divyenpatel · 2024-03-19T17:20:46Z

pkg/csi/service/node.go

+	// NodeUnstageVolume operation.
+	if acquired := driver.volumeLocks.TryAcquire(volumeID); !acquired {
+		return nil, logger.LogNewErrorCodef(log, codes.Aborted,
+			"An operation with the given Volume ID %s already exists", volumeID)


Prefix message with method name: NodeUnstageVolume

pkg/csi/service/node.go

divyenpatel · 2024-03-19T17:23:20Z

some more minor comments. Change looks good to me.

…peration

divyenpatel

/approve

k8s-ci-robot · 2024-03-19T20:48:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chethanv28, deepakkinni, divyenpatel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [chethanv28,deepakkinni,divyenpatel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

divyenpatel · 2024-03-19T20:49:03Z

/ok-to-test

divyenpatel · 2024-03-19T20:58:42Z

/lgtm

tgelter · 2024-04-10T15:42:18Z

Hello all, my team submitted the case with VMware support which led to this PR (thank you!). For our tracking purposes related to the bug this addresses, would you mind letting me know what release version should include this change?

tgelter · 2024-06-26T16:32:09Z

Hello all, my team submitted the case with VMware support which led to this PR (thank you!). For our tracking purposes related to the bug this addresses, would you mind letting me know what release version should include this change?

I see it was included in v3.3.0, thanks!

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 1, 2024

k8s-ci-robot requested review from deepakkinni and divyenpatel March 1, 2024 19:52

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 1, 2024

chethanv28 force-pushed the add-locks-node-ds branch 3 times, most recently from b881bee to 882bf66 Compare March 1, 2024 20:31

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 1, 2024

deepakkinni reviewed Mar 2, 2024

View reviewed changes

pkg/csi/service/common/volume_lock.go Outdated Show resolved Hide resolved

chethanv28 force-pushed the add-locks-node-ds branch from 882bf66 to f5d6fdf Compare March 4, 2024 18:43

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 4, 2024

chethanv28 force-pushed the add-locks-node-ds branch from f5d6fdf to 28c0612 Compare March 4, 2024 19:01

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 4, 2024

chethanv28 force-pushed the add-locks-node-ds branch from 28c0612 to 2b0d801 Compare March 4, 2024 20:14

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 4, 2024

deepakkinni reviewed Mar 4, 2024

View reviewed changes

pkg/csi/service/node.go Outdated Show resolved Hide resolved

chethanv28 force-pushed the add-locks-node-ds branch from 2b0d801 to edc50e9 Compare March 4, 2024 22:59

chethanv28 force-pushed the add-locks-node-ds branch from edc50e9 to 902088e Compare March 19, 2024 00:30

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 19, 2024

divyenpatel reviewed Mar 19, 2024

View reviewed changes

pkg/common/cns-lib/node/volume_operation_lock.go Show resolved Hide resolved

divyenpatel reviewed Mar 19, 2024

View reviewed changes

pkg/csi/service/node.go Outdated Show resolved Hide resolved

divyenpatel reviewed Mar 19, 2024

View reviewed changes

pkg/csi/service/node.go Outdated Show resolved Hide resolved

divyenpatel reviewed Mar 19, 2024

View reviewed changes

pkg/csi/service/node.go Outdated Show resolved Hide resolved

Use lock per VolumeId during node Publish,UnPublish,Stage & Unstage o…

10ffe42

…peration

chethanv28 force-pushed the add-locks-node-ds branch from 902088e to 10ffe42 Compare March 19, 2024 18:53

divyenpatel approved these changes Mar 19, 2024

View reviewed changes

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Mar 19, 2024

k8s-ci-robot assigned divyenpatel Mar 19, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2024

divyenpatel merged commit 4b5a218 into kubernetes-sigs:master Mar 19, 2024
7 of 12 checks passed

dfajmon mentioned this pull request Sep 11, 2024

STOR-2014: Rebase to upstream v3.3.1 for OCP 4.18 openshift/vmware-vsphere-csi-driver#128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lock per VolumeId during NodeUnstageVolume operation to avoid conflicts during volume unmount #2811

Use lock per VolumeId during NodeUnstageVolume operation to avoid conflicts during volume unmount #2811

chethanv28 commented Mar 1, 2024 •

edited

Loading

svcbot-qecnsdp commented Mar 1, 2024

svcbot-qecnsdp commented Mar 2, 2024

svcbot-qecnsdp commented Mar 4, 2024

deepakkinni commented Mar 15, 2024

lipingxue commented Mar 15, 2024

svcbot-qecnsdp commented Mar 19, 2024

svcbot-qecnsdp commented Mar 19, 2024

divyenpatel Mar 19, 2024

divyenpatel commented Mar 19, 2024

divyenpatel left a comment

k8s-ci-robot commented Mar 19, 2024

divyenpatel commented Mar 19, 2024

divyenpatel commented Mar 19, 2024

tgelter commented Apr 10, 2024

tgelter commented Jun 26, 2024

Use lock per VolumeId during NodeUnstageVolume operation to avoid conflicts during volume unmount #2811

Use lock per VolumeId during NodeUnstageVolume operation to avoid conflicts during volume unmount #2811

Conversation

chethanv28 commented Mar 1, 2024 • edited Loading

svcbot-qecnsdp commented Mar 1, 2024

svcbot-qecnsdp commented Mar 2, 2024

svcbot-qecnsdp commented Mar 4, 2024

deepakkinni commented Mar 15, 2024

lipingxue commented Mar 15, 2024

svcbot-qecnsdp commented Mar 19, 2024

svcbot-qecnsdp commented Mar 19, 2024

divyenpatel Mar 19, 2024

Choose a reason for hiding this comment

divyenpatel commented Mar 19, 2024

divyenpatel left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 19, 2024

divyenpatel commented Mar 19, 2024

divyenpatel commented Mar 19, 2024

tgelter commented Apr 10, 2024

tgelter commented Jun 26, 2024

chethanv28 commented Mar 1, 2024 •

edited

Loading