-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 284: Add PRR for volume expansion feature #3195
KEP 284: Add PRR for volume expansion feature #3195
Conversation
cc @msau42 @jsafrane @xing-yang /sig storage |
/assign @deads2k |
@@ -0,0 +1,7 @@ | |||
kep-number: 284 | |||
alpha: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we have PRR for alpha and beta? If not, I think we can remove those entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
|
||
###### What specific metrics should inform a rollback? | ||
|
||
The `volume_mount` operation failure metric - `storage_operation_errors_total{operation_name=volume_mount}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about controller and CSI failures? Do we have failure metrics that's more fine-grained to just the expand operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage_operation_errors_total{operation_name=expand_volume} and storage_operation_errors_total{operation_name=volume_fs_resize}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we do not any specific metrics from external-resizer that are more fine grained. We will have to add those I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some text about adding new metric from external-resizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come kubernetes-csi/external-resizer#67 does not suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I did originally document csi_sidecar_operations_seconds
- but was wondering if we still need intree equivalent but I have come to the conclusion that, we don't. It does mean that, we need to document what intree metrics for these operations are and what are the CSI equivalents. So I have just noted that point and removed the need for new metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also document those metrics in this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
- [x] Events | ||
- Event Reason: External resizer is resizing volume pvc-a71483ed-a5bc-48fa-9151-ca41e7e6634e | ||
- [x API .status | ||
- Condition name: "Resizing" or "FileSystemResizePending" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a condition that specifies if a resize finished successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All resizing related conditions are removed when resizing is finished, but there is no specific condition, to indicate successful completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditions here seem to indicate a resizing operation has started but not completed yet. Is there another event that indicates resize has completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is VolumeResizeSuccessful
event on PVC which is emitted when resizing is finished successfully. I will document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably remove this condition because it doesn't indicate if it was successful or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which specific condition or did you mean event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant remove the "Pending" conditions, because they cannot determine if resize was successful or not. Or are you saying they can at least determine that controlle expand was successful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the Pending conditions, but since it is part of beta API - I thought we have to go through some sort of deprecation phases. Or can we remove it right away in this release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing is - documentation. We have documented FileSystemResizePending
as an example of - pending condition when may be a driver supports only offline expansion. All of that documentation has been around for more than 2 years now and needs to be updated if we remove the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed for successful completion condition.
- [Optional] Aggregation method: percentile | ||
- Components exposing the metric: kube-controller-manager | ||
- controller expansion operation errors: | ||
- Metric name: storage_operation_errors_total{operation_name=expand_volume} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage_operation_duration_seconds now also includes return status: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/metrics.go#L49
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh thanks. updated this description.
|
||
- Expansion can be permanently stuck: | ||
- Detection: Check conditions on `pvc.status` | ||
- Mitigations: If expansion is stuck permanently because of issues in backend and can not be recovered then, it requires manual intervention. Steps to recover from expansion failures are documented in - https://kubernetes.io/docs/concepts/storage/persistent-volumes/#recovering-from-failure-when-expanding-volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the recover from resize feature eliminate these manual steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in some cases, if no recovery is possible - say when volume was expanded in controller but failing on the node, then recover from resize failure feature will not help. So admins may still have to take some action if that happens.
- Will enabling / disabling the feature require downtime of the control | ||
plane? | ||
- Will enabling / disabling the feature require downtime or reprovisioning | ||
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you answer this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
||
###### What specific metrics should inform a rollback? | ||
|
||
The `volume_mount` operation failure metric - `storage_operation_errors_total{operation_name=volume_mount}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage_operation_errors_total{operation_name=expand_volume} and storage_operation_errors_total{operation_name=volume_fs_resize}?
@xing-yang @msau42 addressed those comments PTAL. |
PRR looks good, lgtm is up to the sig /approve |
@msau42 addressed your comments. PTAL. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, gnufied, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1367f77
to
133ec96
Compare
133ec96
to
e74df66
Compare
/lgtm |
This adds PRR for xref#284.. Volume expansion feature has been in beta since v1.11 and has sister feature gates which cover other aspects of expansion namely:
Currently I am using - this single KEP to migrate and move all the above feature-gates to GA (rather than doing them individually). At this point in v1.23 they are all very related and same. It made sense to add new feature gates when we were adding features incrementally but after 8 or 9 releases, I am not sure it makes sense.