Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 284: Add PRR for volume expansion feature #3195

Merged
merged 5 commits into from
Feb 2, 2022

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Jan 27, 2022

This adds PRR for xref#284.. Volume expansion feature has been in beta since v1.11 and has sister feature gates which cover other aspects of expansion namely:

  1. ExpandPersistentVolumes - general and parent expansion feature.
  2. ExpandInUsePersistentVolumes - online volume expansion. This is the default expansion mode.
  3. ExpandCSIVolumes - Enable expansion of CSI volumes. Also in beta.

Currently I am using - this single KEP to migrate and move all the above feature-gates to GA (rather than doing them individually). At this point in v1.23 they are all very related and same. It made sense to add new feature gates when we were adding features incrementally but after 8 or 9 releases, I am not sure it makes sense.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 27, 2022
@k8s-ci-robot k8s-ci-robot requested a review from msau42 January 27, 2022 20:09
@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jan 27, 2022
@k8s-ci-robot k8s-ci-robot requested a review from saad-ali January 27, 2022 20:09
@gnufied
Copy link
Member Author

gnufied commented Jan 27, 2022

cc @msau42 @jsafrane @xing-yang

/sig storage

@gnufied
Copy link
Member Author

gnufied commented Jan 27, 2022

/assign @deads2k

@@ -0,0 +1,7 @@
kep-number: 284
alpha:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we have PRR for alpha and beta? If not, I think we can remove those entries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

keps/sig-storage/284-enable-volume-expansion/README.md Outdated Show resolved Hide resolved

###### What specific metrics should inform a rollback?

The `volume_mount` operation failure metric - `storage_operation_errors_total{operation_name=volume_mount}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about controller and CSI failures? Do we have failure metrics that's more fine-grained to just the expand operation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage_operation_errors_total{operation_name=expand_volume} and storage_operation_errors_total{operation_name=volume_fs_resize}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we do not any specific metrics from external-resizer that are more fine grained. We will have to add those I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some text about adding new metric from external-resizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come kubernetes-csi/external-resizer#67 does not suffice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I did originally document csi_sidecar_operations_seconds - but was wondering if we still need intree equivalent but I have come to the conclusion that, we don't. It does mean that, we need to document what intree metrics for these operations are and what are the CSI equivalents. So I have just noted that point and removed the need for new metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also document those metrics in this section?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

- [x] Events
- Event Reason: External resizer is resizing volume pvc-a71483ed-a5bc-48fa-9151-ca41e7e6634e
- [x API .status
- Condition name: "Resizing" or "FileSystemResizePending"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a condition that specifies if a resize finished successfully?

Copy link
Member Author

@gnufied gnufied Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All resizing related conditions are removed when resizing is finished, but there is no specific condition, to indicate successful completion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditions here seem to indicate a resizing operation has started but not completed yet. Is there another event that indicates resize has completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is VolumeResizeSuccessful event on PVC which is emitted when resizing is finished successfully. I will document it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably remove this condition because it doesn't indicate if it was successful or not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which specific condition or did you mean event?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant remove the "Pending" conditions, because they cannot determine if resize was successful or not. Or are you saying they can at least determine that controlle expand was successful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the Pending conditions, but since it is part of beta API - I thought we have to go through some sort of deprecation phases. Or can we remove it right away in this release?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing is - documentation. We have documented FileSystemResizePending as an example of - pending condition when may be a driver supports only offline expansion. All of that documentation has been around for more than 2 years now and needs to be updated if we remove the condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed for successful completion condition.

- [Optional] Aggregation method: percentile
- Components exposing the metric: kube-controller-manager
- controller expansion operation errors:
- Metric name: storage_operation_errors_total{operation_name=expand_volume}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage_operation_duration_seconds now also includes return status: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/metrics.go#L49

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thanks. updated this description.


- Expansion can be permanently stuck:
- Detection: Check conditions on `pvc.status`
- Mitigations: If expansion is stuck permanently because of issues in backend and can not be recovered then, it requires manual intervention. Steps to recover from expansion failures are documented in - https://kubernetes.io/docs/concepts/storage/persistent-volumes/#recovering-from-failure-when-expanding-volumes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the recover from resize feature eliminate these manual steps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in some cases, if no recovery is possible - say when volume was expanded in controller but failing on the node, then recover from resize failure feature will not help. So admins may still have to take some action if that happens.

- Will enabling / disabling the feature require downtime of the control
plane?
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you answer this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


###### What specific metrics should inform a rollback?

The `volume_mount` operation failure metric - `storage_operation_errors_total{operation_name=volume_mount}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage_operation_errors_total{operation_name=expand_volume} and storage_operation_errors_total{operation_name=volume_fs_resize}?

keps/sig-storage/284-enable-volume-expansion/README.md Outdated Show resolved Hide resolved
keps/sig-storage/284-enable-volume-expansion/README.md Outdated Show resolved Hide resolved
@gnufied
Copy link
Member Author

gnufied commented Jan 31, 2022

@xing-yang @msau42 addressed those comments PTAL.

@deads2k
Copy link
Contributor

deads2k commented Jan 31, 2022

PRR looks good, lgtm is up to the sig

/approve

@gnufied
Copy link
Member Author

gnufied commented Feb 1, 2022

@msau42 addressed your comments. PTAL.

@msau42
Copy link
Member

msau42 commented Feb 1, 2022

/approve
just some minor nits

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, gnufied, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2022
@gnufied gnufied force-pushed the update-volume-expansion-kep branch from 1367f77 to 133ec96 Compare February 2, 2022 02:23
@gnufied gnufied force-pushed the update-volume-expansion-kep branch from 133ec96 to e74df66 Compare February 2, 2022 02:29
@msau42
Copy link
Member

msau42 commented Feb 2, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022
@k8s-ci-robot k8s-ci-robot merged commit f805470 into kubernetes:master Feb 2, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants