Skip to content

Latest commit

 

History

History
1087 lines (841 loc) · 45.8 KB

File metadata and controls

1087 lines (841 loc) · 45.8 KB

KEP-4650: StatefulSet Support for Updating Volume Claim Template

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Kubernetes does not support the modification of the volumeClaimTemplates of a StatefulSet currently. This enhancement proposes to support modifications to the volumeClaimTemplates, automatically patching the associated PersistentVolumeClaim objects if applicable. Currently, PVC spec.resources.requests.storage, spec.volumeAttributesClassName, metadata.labels, and metadata.annotations can be patched. All the updates to PersistentVolumeClaim can be coordinated with Pod updates to honor any dependencies between them.

Motivation

Currently there are very few things that users can do to update the volumes of their existing StatefulSet deployments. They can only expand the volumes, or modify them with VolumeAttributesClass by updating individual PersistentVolumeClaim objects as an ad-hoc operation. When the StatefulSet scales up, the new PVC(s) will be created with the old config and this again needs manual intervention. This brings many headaches in a continuously evolving environment.

Goals

  • Allow users to update some fields of volumeClaimTemplates of a StatefulSet.
  • Automatically patch the associated PersistentVolumeClaim objects, without interrupting the running Pods.
  • Support updating PersistentVolumeClaim objects with OnDelete strategy.
  • Coordinate updates to Pod and PersistentVolumeClaim objects.
  • Provide accurate status and error messages to users when the update fails.

Non-Goals

  • Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically.
  • Validate the updated volumeClaimTemplates as how PVC patch does.
  • Update ephemeral volumes.
  • Patch PVCs that are different from the template, e.g. StatefulSet adopts the pre-existing PVCs.

Proposal

  1. Change API server to allow specific updates to volumeClaimTemplates of a StatefulSet:

    • labels
    • annotations
    • resources.requests.storage
    • volumeAttributesClassName
  2. Modify StatefulSet controller to add PVC reconciliation logic.

  3. Collect the status of managed PVCs, and show them in the StatefulSet status.

Kubernetes API Changes

Changes to StatefulSet spec:

Introduce a new field in StatefulSet spec: volumeClaimUpdatePolicy to specify how to coordinate the update of PVCs and Pods. Possible values are:

  • OnDelete: the default value, only update the PVC when the the old PVC is deleted.
  • InPlace: patch the PVC in-place if possible. Also includes the OnDelete behavior.

Changes to StatefultSet status:

Additionally collect the status of managed PVCs, and show them in the StatefulSet status.

For each PVC in the template:

  • compatible: the number of PVCs that are compatible with the template. These replicas will not be blocked on Pod recreation.
  • updating: the number of PVCs that are being updated in-place (e.g. expansion in progress).
  • overSized: the number of PVCs that are larger than the template.
  • totalCapacity: the sum of status.capacity of all the PVCs.

Some fields in the status are also updated to reflect the staus of the PVCs:

  • readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if:
    • volumeClaimUpdatePolicy is InPlace and the PVC is updating;
  • availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least minReadySeconds
  • currentRevision, updateRevision, currentReplicas, updatedReplicas are updated to reflect the status of PVCs.

With these changes, user can still use kubectl rollout status to monitor the update process, both for automated patching and for the PVCs that need manual intervention.

Updated Reconciliation Logic

How to update PVCs:

  1. If volumeClaimUpdatePolicy is InPlace, and if volumeClaimTemplates and actual PVC only differ in mutable fields (spec.resources.requests.storage, spec.volumeAttributesClassName, metadata.labels, and metadata.annotations currently), patch the PVC to the extent possible.

    • spec.resources.requests.storage is patched to max(template spec, PVC status)
      • Do not decreasing the storage size below its current status. Note that decrease the size in PVC spec can help recover from a failed expansion if RecoverVolumeExpansionFailure feature gate is enabled.
    • spec.volumeAttributesClassName is patched to the template value.
    • metadata.labels and metadata.annotations are patched with server side apply.
  2. If it is not possible to make the PVC compatible, do nothing. But when recreating a Pod and the corresponding PVC is deleting, wait for the deletion then create a new PVC together with the new Pod (already implemented).

  1. Use either current or updated revision of the volumeClaimTemplates to create/update the PVC, just like Pod template.

When to update PVCs:

  1. before advancing status.updatedReplicas to the next replica, check that the PVCs of the next replica are compatible with the new volumeClaimTemplates. If not, and if we are not going to patch it automatically, wait for the user to delete/update the old PVC manually.

  2. When doing rolling update, A replica is considered ready if the Pod is ready and all its volumes are not being updated in-place. Wait for a replica to be ready for at least minReadySeconds before proceeding to the next replica.

  3. Whenever we check for Pod update, also check for PVCs update. e.g.:

    • If spec.updateStrategy.type is RollingUpdate, update the PVCs in the order from the largest ordinal to the smallest.
    • If spec.updateStrategy.type is OnDelete, Only update the PVC when the Pod is deleted.
  4. When patching the PVC, if we also re-create the Pod, update the PVC after old Pod deleted, together with creating new pod. Otherwise, if pod is not changed, update the PVC only.

Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order.

  • If the PVC update fails, we should block the update process. If the Pod is also deleted (by controller or manually), don't block the creation of new Pod. We should retry and report events for this. The events and status should look like those when the Pod creation fails.

  • While waiting for the PVC to reach the compatible state, We should update status, just like what we do when waiting for Pod to be ready. We should block the update process if the PVC is never compatible.

  • If the volumeClaimTemplates is updated again when the previous rollout is blocked, similar to Pods, user may need to manually deal with the blocking PVCs (update or delete them).

What PVC is compatible

A PVC is compatible with the template if:

  • All the immutable fields match exactly; and
  • metadata.labels and metadata.annotations of PVC is a superset of the template; and
  • status.capacity.storage of PVC is greater than or equal to the spec.resources.requests.storage of the template; and
  • status.currentVolumeAttributesClassName of PVC is equal to the spec.volumeAttributesClassName of the template.

User Stories (Optional)

Story 1: Batch Expand Volumes

We're running a CI/CD system and the end-to-end automation is desired. To expand the volumes managed by a StatefulSet, we can just use the same pipeline that we are already using to update the Pod. All the test, review, approval, and rollback process can be reused.

Story 2: Shinking the PV by Re-creating PVC

After running our app for a while, we optimize the data layout and reduce the required storage size. Now we want to shrink the PVs to save cost. We can not afford any downtime, so we don't want to delete and recreate the StatefulSet. We also don't have the infrastructure to migrate between two StatefulSets. Our app can automatically rebuild the data in the new storage from other replicas. So we update the volumeClaimTemplates of the StatefulSet, delete the PVC and Pod of one replica, let the controller re-create them, then monitor the rebuild process. Once the rebuild completes successfully, we proceed to the next replica.

Story 3: Asymmetric Replicas

The storage requirement of different replicas are not identical, so we still want to update each PVC manually and separately. Possibly we also update the volumeClaimTemplates for new replicas, but we don't want the controller to interfere with the existing replicas.

Notes/Constraints/Caveats (Optional)

When designing the InPlace update strategy, we update the PVC like how we re-create the Pod. i.e. we update the PVC whenever we would re-create the Pod; we wait for the PVC to be compatible whenever we would wait for the Pod to be available.

The StatefulSet controller should also keeps the current and updated revision of the volumeClaimTemplates, so that a StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated.

Risks and Mitigations

TODO: Recover from failed in-place update (insufficient storage, etc.) What else is needed in addition to revert the StatefulSet spec?

Design Details

We can use Server Side Apply to patch the PVCs, so that we will not interfere with the user's manual changes, e.g. to metadata.labels and metadata.annotations.

New invariants established about PVCs: If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A.

Test Plan

[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests
  • <package>: <date> - <test coverage>
Integration tests
  • :
e2e tests
  • :

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: StatefulSetUpdateVolumeClaimTemplate
    • Components depending on the feature gate:
      • kube-apiserver
      • kube-controller-manager
Does enabling the feature change any default behavior?

The update to StatefulSet volumeClaimTemplates will be accepted by the API server while it is previously rejected.

Otherwise No. If volumeClaimUpdatePolicy is OnDelete (the default values), the behavior of StatefulSet controller is almost the same as before.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Since the volumeClaimTemplates can already differ from the actual PVCs now, disable this feature gate should not leave any inconsistent state.

If the volumeClaimTemplates is updated then the feature is disabled and the StatefulSet is rolled back, The volumeClaimTemplates will be kept as the latest version, and the history of them will be lost.

What happens if we reenable the feature if it was previously rolled back?

If the volumeClaimUpdatePolicy is already set to InPlace reenable the feature will kick off the update process immediately.

Are there any tests for feature enablement/disablement?

Will add unit tests for the StatefulSet controller with and without the feature gate, volumeClaimUpdatePolicy set to InPlace and OnDelete respectively.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?
  • PATCH StatefulSet
    • kubectl or other user agents
  • PATCH PersistentVolumeClaim
    • 1 per updated PVC in the StatefulSet (number of updated claim template * replica)
    • StatefulSet controller (in KCM)
    • triggered by the StatefulSet spec update
  • PATCH StatefulSet status
    • 1-2 per updated PVC in the StatefulSet (number of updated claim template * replica)
    • StatefulSet controller (in KCM)
    • triggered by the StatefulSet spec update and PVC status update
Will enabling / using this feature result in introducing new API types?

No

Will enabling / using this feature result in any new calls to the cloud provider?

Not directly. The cloud provider may be called when the PVCs are updated.

Will enabling / using this feature result in increasing size or count of the existing API objects?

StatefulSet:

  • spec: 2 new enum fields, ~10B
  • status: 4 new integer fields, ~10B
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

The logic of StatefulSet controller is more complex, more CPU will be used. TODO: measure the actual increase.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Extensively validate the updated volumeClaimTemplates

KEP-0661 proposes that we should do extensive validation on the updated volumeClaimTemplates. e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it. However, this have saveral drawbacks:

  • If we disallow decreasing, we make the editing a one-way road. If a user edited it then found it was a mistake, there is no way back. The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable.
  • To mitigate the above issue, we will want to prevent the user from going down this one-way road by mistake. We are forced to do way more validations on APIServer, which is very complex, and fragile (please see KEP-0661). For example: check storage class allowVolumeExpansion, check each PVC's storage class and size, basically duplicate all the validations we have done to PVC. And even if we do all the validations, there are still race conditions and async failures that we are impossible to catch. I see this as a major drawback of KEP-0661 that I want to avoid in this KEP.
  • Validation means we should disable rollback of storage size. If we enable it later, it can surprise users, if it is not called a breaking change.
  • The validation is conflict to RecoverVolumeExpansionFailure feature, although it is still alpha.
  • volumeClaimTemplates is also used when creating new PVCs, so even if the existing PVCs cannot be updated, a user may still want to affect new PVCs.
  • It violates the high-level design. The template describes a desired final state, rather than an immediate instruction. A lot of things can happen externally after we update the template. For example, I have an IaaS platform, which tries to kubectl apply one updated StatefulSet + one new StorageClass to the cluster to trigger the expansion of PVs. We don't want to reject it just because the StorageClass is applied after the StatefulSet.

Support for updating arbitrary fields in volumeClaimTemplates

No technical limitations. Just that we want to be careful and keep the changes small, so that we can move faster. This is just an extra validation in APIServer. We may remove it later if we find it is not needed.

Patch PVC size regardless of the immutable fields

We propose to patch the PVC as a whole, so it can only succeed if the immutable fields matches.

If only expansion is supported, patching regardless of the immutable fields can be a logical choice. But this KEP also integrates with VAC. VAC is closely coupled with storage class. Only patching VAC if storage class matches is a very logical choice. And we'd better follow the same operation model for all mutable fields.

Support for automatically skip not managed PVCs

Introduce a new field in StatefulSet spec.updateStrategy.rollingUpdate: volumeClaimSyncStrategy. If it is set to Async, then we skip patching the PVCs that are not managed by the StatefulSet (e.g. StorageClass does not match).

The rules to determine what PVCs are managed are a little bit tricky. We have to check each field, and determine what to do for each field. This makes us deeply coupled with the PVC implementation.

And still, we want to keep the changes small.

Reconcile all PVCs regardless of Pod revision labels

Like Pods, we only update the PVCs if the Pod revision labels is not the update revision.

We need to unmarshal all revisions used by Pods to determine the desired PVC spec. Even if we do so, we don't want to send a apply request for each PVC at each reconcile iteration. We also don't want to replicate the SSA merging/extraction and validation logic, which can be complex and CPU-intensive.

Treat all incompatible PVCs as unavailable replicas

Currently, incompatible PVCs only blocks the rolling update, not scaling up or down. Only the update revision is used for checking.

We need to unmarshal all revisions used by Pods to determine the compatibility. Even if we do so, old StatefulSets do not have claim info in its history. If we just use the latest version, then all replicas may suddenly become unavailable, and all operations are blocked.

Infrastructure Needed (Optional)