Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle case of recovery from resize failures #187

Merged
merged 5 commits into from
Jan 21, 2022

Conversation

gnufied
Copy link
Contributor

@gnufied gnufied commented Jan 13, 2022

Handle the case of recovery from resize failure.

xref kubernetes/enhancements#1790

This is mostly same implementation from intree controller code - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/operationexecutor/operation_generator.go#L1782 with minor changes because of node-only expansion handling as described in KEP.

Manually tested following scenarios (with hostpath driver):

  • If controller-expansion failed, it allows recovery to a size > pvc.Status.Cap
  • if Node expansion is pending, it does not try expansion to new size until previous expansion is finished
  • Tested with consecutive changes which can cause resize success and failure
  • Tested with node expansion failing (control-plane succeeded) and users reducing size (no recovery is possible)
  • Tested with node expansion failing but no controller-expansion is available (recovery is possible)
Add support to allow users to recover from expansion failures

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Jan 13, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 13, 2022
@gnufied
Copy link
Contributor Author

gnufied commented Jan 13, 2022

/assign @jsafrane

@gnufied
Copy link
Contributor Author

gnufied commented Jan 13, 2022

/assign @jingxu97 @xing-yang

fix bug with switch..case statement
@gnufied gnufied force-pushed the fix-recover-resize branch from 4cdc2b7 to 9f08c69 Compare January 14, 2022 19:15
pkg/controller/expand_and_recover.go Outdated Show resolved Hide resolved
pkg/controller/expand_and_recover.go Show resolved Hide resolved
pkg/controller/expand_and_recover.go Outdated Show resolved Hide resolved
pkg/controller/expand_and_recover.go Show resolved Hide resolved
pkg/controller/resize_status.go Outdated Show resolved Hide resolved
pkg/controller/resize_status.go Show resolved Hide resolved
pkg/controller/expand_and_recover_test.go Outdated Show resolved Hide resolved
pkg/controller/expand_and_recover_test.go Outdated Show resolved Hide resolved
pkg/controller/expand_and_recover_test.go Show resolved Hide resolved
@jsafrane
Copy link
Contributor

I tested it with AWS EBS (it allows one expansion of a volume in 6 hours, so it can throw nice errors).
/lgtm
/hold
(for squash)

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jan 20, 2022
// checks if pv can be expanded
func (ctrl *resizeController) pvCanBeExpanded(pv *v1.PersistentVolume, pvc *v1.PersistentVolumeClaim) bool {
if !ctrl.resizer.CanSupport(pv, pvc) {
klog.V(4).Infof("Resizer %q doesn't support PV %q", ctrl.name, pv.Name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it can be a warning because you are calling resize, but resizer does not support it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can get called on all sort of PVs. such as intree pv,pvc etc, which are not resizable by external-resizer. I think Info is suitable since most people will be running both intree expand_controller and this controller side-by-side for now.

_, _, err, _ := ctrl.expandAndRecover(pvc, pv)
return err
} else {
if !ctrl.pvNeedResize(pvc, pv) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seem pvNeedResize can be called before checking feature is enabled?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is pvNeedResize more like pv cannot be resized due to condition is not met? Does it consider an error?

Copy link
Contributor Author

@gnufied gnufied Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically old flow - which checks if PV is resizable (such as, whether it is bound and bound to PVC which is being resized and whether it was already controller expanded and node expansion is pending). It is not a hard error if PV can not expanded at this moment. But anyways - this is old code and has kinda always worked okay.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I confused about pvCanBeExpanded and pvNeedResize, thinking they are the same function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I am sorry about similar names. I am hoping to delete pvNeedResize when this feature goes beta.


return ctrl.resizePVC(pvc, pv)
if utilfeature.DefaultFeatureGate.Enabled(features.RecoverVolumeExpansionFailure) {
_, _, err, _ := ctrl.expandAndRecover(pvc, pv)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return pv, pvc, but without using use objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, mainly for tests.

if err != nil {
return nil, err
return updatedPVC, err
}
return updatedPVC, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since updatedPVC is always returned now, we can remove the "if err != nil" code block and just have "return updatedPVC, err" as this will cover the case when err is nil as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -545,19 +552,19 @@ func (ctrl *resizeController) markPVCResizeFinished(
return nil
}

func (ctrl *resizeController) patchClaim(oldPVC, newPVC *v1.PersistentVolumeClaim) (*v1.PersistentVolumeClaim, error) {
patchBytes, err := util.GetPVCPatchData(oldPVC, newPVC)
func (ctrl *resizeController) patchClaim(oldPVC, newPVC *v1.PersistentVolumeClaim, addResourceVersionCheck bool) (*v1.PersistentVolumeClaim, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some comments about this option addResourceVersionCheck?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@gnufied gnufied force-pushed the fix-recover-resize branch from 15fcebc to 7ca824f Compare January 20, 2022 22:53
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 20, 2022
@gnufied
Copy link
Contributor Author

gnufied commented Jan 20, 2022

@jingxu97 @xing-yang addressed your comments. can you PTAL?

@xing-yang
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 20, 2022
newPVC := pvc.DeepCopy()
newPVC.Status.ResizeStatus = &expansionFailedOnController

updatedPVC, err := ctrl.patchClaim(pvc, newPVC, false /* addResourceVersionCheck */)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason no need to check resource version here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general idea I had was if expansion failed we record the error even if we get a potential conflict. Since we are only patching resize status we should be okay.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the result of conflict? It might overwrite some fields which are updated by other places?
how about we also check addResourceVersionCheck here? Any potential issue?

Copy link
Contributor Author

@gnufied gnufied Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conflict when addResourceVersionCheck is present has nothing to do with content of the patch, but is about whether patch we are applying is with latest version of object or not. i.e - if two updates were performed on PVC concurrently and they updated different fields, then if addResourceVersionCheck is present with second patch - you will still get error (even if second update changes entirely different field of PVC).

how about we also check addResourceVersionCheck here? Any potential issue?

We can, but if somehow our version of PVC was older then, the entire resizing operation has to be restarted before ExpansionFailedOnController can be set for resizeStatus. As such it does not affect the design but skipping addResourceVersionCheck is an optimization - so as we set this field if expansion fails on controller and even if PVC our version of PVC was slightly older somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a comment in the code.

@gnufied gnufied force-pushed the fix-recover-resize branch from 7ca824f to 8c552c3 Compare January 21, 2022 00:50
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 21, 2022
@jingxu97
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 21, 2022
@gnufied
Copy link
Contributor Author

gnufied commented Jan 21, 2022

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2022
@k8s-ci-robot k8s-ci-robot merged commit 9f5f2c2 into kubernetes-csi:master Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants