-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle case of recovery from resize failures #187
Handle case of recovery from resize failures #187
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @jsafrane |
/assign @jingxu97 @xing-yang |
2aa9ed0
to
4cdc2b7
Compare
fix bug with switch..case statement
4cdc2b7
to
9f08c69
Compare
I tested it with AWS EBS (it allows one expansion of a volume in 6 hours, so it can throw nice errors). |
// checks if pv can be expanded | ||
func (ctrl *resizeController) pvCanBeExpanded(pv *v1.PersistentVolume, pvc *v1.PersistentVolumeClaim) bool { | ||
if !ctrl.resizer.CanSupport(pv, pvc) { | ||
klog.V(4).Infof("Resizer %q doesn't support PV %q", ctrl.name, pv.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it can be a warning because you are calling resize, but resizer does not support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can get called on all sort of PVs. such as intree pv,pvc etc, which are not resizable by external-resizer. I think Info is suitable since most people will be running both intree expand_controller and this controller side-by-side for now.
_, _, err, _ := ctrl.expandAndRecover(pvc, pv) | ||
return err | ||
} else { | ||
if !ctrl.pvNeedResize(pvc, pv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seem pvNeedResize can be called before checking feature is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is pvNeedResize more like pv cannot be resized due to condition is not met? Does it consider an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically old flow - which checks if PV is resizable (such as, whether it is bound and bound to PVC which is being resized and whether it was already controller expanded and node expansion is pending). It is not a hard error if PV can not expanded at this moment. But anyways - this is old code and has kinda always worked okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I confused about pvCanBeExpanded and pvNeedResize, thinking they are the same function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I am sorry about similar names. I am hoping to delete pvNeedResize
when this feature goes beta.
|
||
return ctrl.resizePVC(pvc, pv) | ||
if utilfeature.DefaultFeatureGate.Enabled(features.RecoverVolumeExpansionFailure) { | ||
_, _, err, _ := ctrl.expandAndRecover(pvc, pv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why return pv, pvc, but without using use objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, mainly for tests.
pkg/controller/controller.go
Outdated
if err != nil { | ||
return nil, err | ||
return updatedPVC, err | ||
} | ||
return updatedPVC, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since updatedPVC
is always returned now, we can remove the "if err != nil" code block and just have "return updatedPVC, err" as this will cover the case when err is nil as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -545,19 +552,19 @@ func (ctrl *resizeController) markPVCResizeFinished( | |||
return nil | |||
} | |||
|
|||
func (ctrl *resizeController) patchClaim(oldPVC, newPVC *v1.PersistentVolumeClaim) (*v1.PersistentVolumeClaim, error) { | |||
patchBytes, err := util.GetPVCPatchData(oldPVC, newPVC) | |||
func (ctrl *resizeController) patchClaim(oldPVC, newPVC *v1.PersistentVolumeClaim, addResourceVersionCheck bool) (*v1.PersistentVolumeClaim, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add some comments about this option addResourceVersionCheck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
15fcebc
to
7ca824f
Compare
@jingxu97 @xing-yang addressed your comments. can you PTAL? |
/lgtm |
newPVC := pvc.DeepCopy() | ||
newPVC.Status.ResizeStatus = &expansionFailedOnController | ||
|
||
updatedPVC, err := ctrl.patchClaim(pvc, newPVC, false /* addResourceVersionCheck */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason no need to check resource version here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general idea I had was if expansion failed we record the error even if we get a potential conflict. Since we are only patching resize status we should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the result of conflict? It might overwrite some fields which are updated by other places?
how about we also check addResourceVersionCheck here? Any potential issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conflict when addResourceVersionCheck
is present has nothing to do with content of the patch, but is about whether patch we are applying is with latest version of object or not. i.e - if two updates were performed on PVC concurrently and they updated different fields, then if addResourceVersionCheck
is present with second patch - you will still get error (even if second update changes entirely different field of PVC).
how about we also check addResourceVersionCheck here? Any potential issue?
We can, but if somehow our version of PVC was older then, the entire resizing operation has to be restarted before ExpansionFailedOnController
can be set for resizeStatus
. As such it does not affect the design but skipping addResourceVersionCheck
is an optimization - so as we set this field if expansion fails on controller and even if PVC our version of PVC was slightly older somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as a comment in the code.
7ca824f
to
8c552c3
Compare
/lgtm |
/hold cancel |
Handle the case of recovery from resize failure.
xref kubernetes/enhancements#1790
This is mostly same implementation from intree controller code - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/operationexecutor/operation_generator.go#L1782 with minor changes because of node-only expansion handling as described in KEP.
Manually tested following scenarios (with hostpath driver):