Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lock per VolumeId during NodeUnstageVolume operation to avoid conflicts during volume unmount #2811

Merged
merged 1 commit into from
Mar 19, 2024

Conversation

chethanv28
Copy link
Collaborator

@chethanv28 chethanv28 commented Mar 1, 2024

What this PR does / why we need it:
PR is adding usage of locking mechanism per VolumeId during NodeUnstageVolume operation.
There have few observations where the 1st NodeUnstageVolume call takes more time and is till going on. Meanwhile, k8s will issue a 2nd NodeUnstageVolume call assuming the 1st NodeUnstageVolume has timed out. The 2nd call succeeds as the target Mountpoint is not found. Therefore, a DetachVolume will be invoked while the 1st NodeUnstageVolume is still in-progress and in-turn corrupts the volume. To avoid the above issue, we can keep a lock per VolumeID during the NodeUnstageVolume operation.

A similar locking mechanism is applied to NodePublish, NodeUnPublish, NodeStage operations as well

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Testing done:
Running e2e pipeline

Release note:

Use lock per VolumeId during NodeUnstageVolume operation

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 1, 2024
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 1, 2024
@chethanv28 chethanv28 force-pushed the add-locks-node-ds branch 3 times, most recently from b881bee to 882bf66 Compare March 1, 2024 20:31
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 1, 2024
@svcbot-qecnsdp
Copy link

Started vanilla Block pipeline... Build Number: 2555

@svcbot-qecnsdp
Copy link

Block vanilla build status: FAILURE 
Stage before exit: e2e-tests 
Jenkins E2E Test Results: 
------------------------------

Ran 1 of 816 Specs in 400.726 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 815 Skipped
PASS

Ginkgo ran 1 suite in 7m45.155169578s
Test Suite Passed
--
Ran 14 of 816 Specs in 12384.724 seconds
FAIL! -- 7 Passed | 7 Failed | 0 Pending | 802 Skipped
--- FAIL: TestE2E (12384.82s)
FAIL

Ginkgo ran 1 suite in 3h26m40.870854721s

Test Suite Failed

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 4, 2024
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 4, 2024
@svcbot-qecnsdp
Copy link

Started Vanilla block pre-checkin pipeline... Build Number: 2682

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 4, 2024
pkg/csi/service/node.go Outdated Show resolved Hide resolved
@deepakkinni
Copy link
Contributor

/approve

@lipingxue
Copy link
Contributor

@chethanv28 Change looks good to me. Are you able to repro the issue locally and test it with your fix to make sure the issue we observed in SR is fixed?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 19, 2024
@svcbot-qecnsdp
Copy link

Started Vanilla block pre-checkin pipeline... Build Number: 2700

@svcbot-qecnsdp
Copy link

Build ID: 2700
Block vanilla build status: SUCCESS 
Stage before exit: e2e-tests 
Jenkins E2E Test Results: 
------------------------------

Ran 1 of 816 Specs in 1889.692 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 815 Skipped
PASS

Ginkgo ran 1 suite in 32m27.542882256s
Test Suite Passed

// NodeUnstageVolume operation.
if acquired := driver.volumeLocks.TryAcquire(volumeID); !acquired {
return nil, logger.LogNewErrorCodef(log, codes.Aborted,
"An operation with the given Volume ID %s already exists", volumeID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix message with method name: NodeUnstageVolume

@divyenpatel
Copy link
Member

some more minor comments. Change looks good to me.

Copy link
Member

@divyenpatel divyenpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chethanv28, deepakkinni, divyenpatel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [chethanv28,deepakkinni,divyenpatel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@divyenpatel
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Mar 19, 2024
@divyenpatel
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2024
@divyenpatel divyenpatel merged commit 4b5a218 into kubernetes-sigs:master Mar 19, 2024
7 of 12 checks passed
@tgelter
Copy link

tgelter commented Apr 10, 2024

Hello all, my team submitted the case with VMware support which led to this PR (thank you!). For our tracking purposes related to the bug this addresses, would you mind letting me know what release version should include this change?

@tgelter
Copy link

tgelter commented Jun 26, 2024

Hello all, my team submitted the case with VMware support which led to this PR (thank you!). For our tracking purposes related to the bug this addresses, would you mind letting me know what release version should include this change?

I see it was included in v3.3.0, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants