- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Extensively validate the updated
volumeClaimTemplates
- Support for updating arbitrary fields in
volumeClaimTemplates
- Patch PVC size regardless of the immutable fields
- Support for automatically skip not managed PVCs
- Reconcile all PVCs regardless of Pod revision labels
- Treat all incompatible PVCs as unavailable replicas
- Extensively validate the updated
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Kubernetes does not support the modification of the volumeClaimTemplates
of a StatefulSet currently.
This enhancement proposes to support modifications to the volumeClaimTemplates
,
automatically patching the associated PersistentVolumeClaim objects if applicable.
Currently, PVC spec.resources.requests.storage
, spec.volumeAttributesClassName
, metadata.labels
, and metadata.annotations
can be patched.
All the updates to PersistentVolumeClaim can be coordinated with Pod
updates
to honor any dependencies between them.
Currently there are very few things that users can do to update the volumes of their existing StatefulSet deployments. They can only expand the volumes, or modify them with VolumeAttributesClass by updating individual PersistentVolumeClaim objects as an ad-hoc operation. When the StatefulSet scales up, the new PVC(s) will be created with the old config and this again needs manual intervention. This brings many headaches in a continuously evolving environment.
- Allow users to update some fields of
volumeClaimTemplates
of aStatefulSet
. - Automatically patch the associated PersistentVolumeClaim objects, without interrupting the running Pods.
- Support updating PersistentVolumeClaim objects with
OnDelete
strategy. - Coordinate updates to
Pod
and PersistentVolumeClaim objects. - Provide accurate status and error messages to users when the update fails.
- Support automatic re-creating of PersistentVolumeClaim. We will never delete a PVC automatically.
- Validate the updated
volumeClaimTemplates
as how PVC patch does. - Update ephemeral volumes.
- Patch PVCs that are different from the template, e.g. StatefulSet adopts the pre-existing PVCs.
-
Change API server to allow specific updates to
volumeClaimTemplates
of a StatefulSet:labels
annotations
resources.requests.storage
volumeAttributesClassName
-
Modify StatefulSet controller to add PVC reconciliation logic.
-
Collect the status of managed PVCs, and show them in the StatefulSet status.
Changes to StatefulSet spec
:
Introduce a new field in StatefulSet spec
: volumeClaimUpdatePolicy
to
specify how to coordinate the update of PVCs and Pods. Possible values are:
OnDelete
: the default value, only update the PVC when the the old PVC is deleted.InPlace
: patch the PVC in-place if possible. Also includes theOnDelete
behavior.
Changes to StatefultSet status
:
Additionally collect the status of managed PVCs, and show them in the StatefulSet status.
For each PVC in the template:
- compatible: the number of PVCs that are compatible with the template. These replicas will not be blocked on Pod recreation.
- updating: the number of PVCs that are being updated in-place (e.g. expansion in progress).
- overSized: the number of PVCs that are larger than the template.
- totalCapacity: the sum of
status.capacity
of all the PVCs.
Some fields in the status
are also updated to reflect the staus of the PVCs:
- readyReplicas: in addition to pods, also consider the PVCs status. A PVC is not ready if:
volumeClaimUpdatePolicy
isInPlace
and the PVC is updating;
- availableReplicas: total number of replicas of which both Pod and PVCs are ready for at least
minReadySeconds
- currentRevision, updateRevision, currentReplicas, updatedReplicas are updated to reflect the status of PVCs.
With these changes, user can still use kubectl rollout status
to monitor the update process,
both for automated patching and for the PVCs that need manual intervention.
How to update PVCs:
-
If
volumeClaimUpdatePolicy
isInPlace
, and ifvolumeClaimTemplates
and actual PVC only differ in mutable fields (spec.resources.requests.storage
,spec.volumeAttributesClassName
,metadata.labels
, andmetadata.annotations
currently), patch the PVC to the extent possible.spec.resources.requests.storage
is patched to max(template spec, PVC status)- Do not decreasing the storage size below its current status.
Note that decrease the size in PVC spec can help recover from a failed expansion if
RecoverVolumeExpansionFailure
feature gate is enabled.
- Do not decreasing the storage size below its current status.
Note that decrease the size in PVC spec can help recover from a failed expansion if
spec.volumeAttributesClassName
is patched to the template value.metadata.labels
andmetadata.annotations
are patched with server side apply.
-
If it is not possible to make the PVC compatible, do nothing. But when recreating a Pod and the corresponding PVC is deleting, wait for the deletion then create a new PVC together with the new Pod (already implemented).
- Use either current or updated revision of the
volumeClaimTemplates
to create/update the PVC, just like Pod template.
When to update PVCs:
-
before advancing
status.updatedReplicas
to the next replica, check that the PVCs of the next replica are compatible with the newvolumeClaimTemplates
. If not, and if we are not going to patch it automatically, wait for the user to delete/update the old PVC manually. -
When doing rolling update, A replica is considered ready if the Pod is ready and all its volumes are not being updated in-place. Wait for a replica to be ready for at least
minReadySeconds
before proceeding to the next replica. -
Whenever we check for Pod update, also check for PVCs update. e.g.:
- If
spec.updateStrategy.type
isRollingUpdate
, update the PVCs in the order from the largest ordinal to the smallest. - If
spec.updateStrategy.type
isOnDelete
, Only update the PVC when the Pod is deleted.
- If
-
When patching the PVC, if we also re-create the Pod, update the PVC after old Pod deleted, together with creating new pod. Otherwise, if pod is not changed, update the PVC only.
Failure cases: don't left too many PVCs being updated in-place. We expect to update the PVCs in order.
-
If the PVC update fails, we should block the update process. If the Pod is also deleted (by controller or manually), don't block the creation of new Pod. We should retry and report events for this. The events and status should look like those when the Pod creation fails.
-
While waiting for the PVC to reach the compatible state, We should update status, just like what we do when waiting for Pod to be ready. We should block the update process if the PVC is never compatible.
-
If the
volumeClaimTemplates
is updated again when the previous rollout is blocked, similar to Pods, user may need to manually deal with the blocking PVCs (update or delete them).
A PVC is compatible with the template if:
- All the immutable fields match exactly; and
metadata.labels
andmetadata.annotations
of PVC is a superset of the template; andstatus.capacity.storage
of PVC is greater than or equal to thespec.resources.requests.storage
of the template; andstatus.currentVolumeAttributesClassName
of PVC is equal to thespec.volumeAttributesClassName
of the template.
We're running a CI/CD system and the end-to-end automation is desired. To expand the volumes managed by a StatefulSet, we can just use the same pipeline that we are already using to update the Pod. All the test, review, approval, and rollback process can be reused.
After running our app for a while, we optimize the data layout and reduce the required storage size.
Now we want to shrink the PVs to save cost.
We can not afford any downtime, so we don't want to delete and recreate the StatefulSet.
We also don't have the infrastructure to migrate between two StatefulSets.
Our app can automatically rebuild the data in the new storage from other replicas.
So we update the volumeClaimTemplates
of the StatefulSet,
delete the PVC and Pod of one replica, let the controller re-create them,
then monitor the rebuild process.
Once the rebuild completes successfully, we proceed to the next replica.
The storage requirement of different replicas are not identical,
so we still want to update each PVC manually and separately.
Possibly we also update the volumeClaimTemplates
for new replicas,
but we don't want the controller to interfere with the existing replicas.
When designing the InPlace
update strategy, we update the PVC like how we re-create the Pod.
i.e. we update the PVC whenever we would re-create the Pod;
we wait for the PVC to be compatible whenever we would wait for the Pod to be available.
The StatefulSet controller should also keeps the current and updated revision of the volumeClaimTemplates
,
so that a StatefulSet can still re-create Pods and PVCs that are yet-to-be-updated.
TODO: Recover from failed in-place update (insufficient storage, etc.) What else is needed in addition to revert the StatefulSet spec?
We can use Server Side Apply to patch the PVCs,
so that we will not interfere with the user's manual changes,
e.g. to metadata.labels
and metadata.annotations
.
New invariants established about PVCs: If the Pod has revision A label, all its PVCs are either not existing yet, or updated to revision A.
[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
<package>
:<date>
-<test coverage>
- :
- :
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: StatefulSetUpdateVolumeClaimTemplate
- Components depending on the feature gate:
- kube-apiserver
- kube-controller-manager
The update to StatefulSet volumeClaimTemplates
will be accepted by the API server while it is previously rejected.
Otherwise No.
If volumeClaimUpdatePolicy
is OnDelete
(the default values),
the behavior of StatefulSet controller is almost the same as before.
Yes. Since the volumeClaimTemplates
can already differ from the actual PVCs now,
disable this feature gate should not leave any inconsistent state.
If the volumeClaimTemplates
is updated then the feature is disabled and the StatefulSet is rolled back,
The volumeClaimTemplates
will be kept as the latest version, and the history of them will be lost.
If the volumeClaimUpdatePolicy
is already set to InPlace
reenable the feature
will kick off the update process immediately.
Will add unit tests for the StatefulSet controller with and without the feature gate,
volumeClaimUpdatePolicy
set to InPlace
and OnDelete
respectively.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
- PATCH StatefulSet
- kubectl or other user agents
- PATCH PersistentVolumeClaim
- 1 per updated PVC in the StatefulSet (number of updated claim template * replica)
- StatefulSet controller (in KCM)
- triggered by the StatefulSet spec update
- PATCH StatefulSet status
- 1-2 per updated PVC in the StatefulSet (number of updated claim template * replica)
- StatefulSet controller (in KCM)
- triggered by the StatefulSet spec update and PVC status update
No
Not directly. The cloud provider may be called when the PVCs are updated.
StatefulSet:
spec
: 2 new enum fields, ~10Bstatus
: 4 new integer fields, ~10B
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
The logic of StatefulSet controller is more complex, more CPU will be used. TODO: measure the actual increase.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
KEP-0661 proposes that we should do extensive validation on the updated volumeClaimTemplates
.
e.g., prevent decreasing the storage size, preventing expand if the storage class does not support it.
However, this have saveral drawbacks:
- If we disallow decreasing, we make the editing a one-way road. If a user edited it then found it was a mistake, there is no way back. The StatefulSet will be broken forever. If this happens, the updates to pods will also be blocked. This is not acceptable.
- To mitigate the above issue, we will want to prevent the user from going down this one-way road by mistake. We are forced to do way more validations on APIServer, which is very complex, and fragile (please see KEP-0661). For example: check storage class allowVolumeExpansion, check each PVC's storage class and size, basically duplicate all the validations we have done to PVC. And even if we do all the validations, there are still race conditions and async failures that we are impossible to catch. I see this as a major drawback of KEP-0661 that I want to avoid in this KEP.
- Validation means we should disable rollback of storage size. If we enable it later, it can surprise users, if it is not called a breaking change.
- The validation is conflict to RecoverVolumeExpansionFailure feature, although it is still alpha.
volumeClaimTemplates
is also used when creating new PVCs, so even if the existing PVCs cannot be updated, a user may still want to affect new PVCs.- It violates the high-level design. The template describes a desired final state, rather than an immediate instruction. A lot of things can happen externally after we update the template. For example, I have an IaaS platform, which tries to kubectl apply one updated StatefulSet + one new StorageClass to the cluster to trigger the expansion of PVs. We don't want to reject it just because the StorageClass is applied after the StatefulSet.
No technical limitations. Just that we want to be careful and keep the changes small, so that we can move faster. This is just an extra validation in APIServer. We may remove it later if we find it is not needed.
We propose to patch the PVC as a whole, so it can only succeed if the immutable fields matches.
If only expansion is supported, patching regardless of the immutable fields can be a logical choice. But this KEP also integrates with VAC. VAC is closely coupled with storage class. Only patching VAC if storage class matches is a very logical choice. And we'd better follow the same operation model for all mutable fields.
Introduce a new field in StatefulSet spec.updateStrategy.rollingUpdate
: volumeClaimSyncStrategy
.
If it is set to Async
, then we skip patching the PVCs that are not managed by the StatefulSet (e.g. StorageClass does not match).
The rules to determine what PVCs are managed are a little bit tricky. We have to check each field, and determine what to do for each field. This makes us deeply coupled with the PVC implementation.
And still, we want to keep the changes small.
Like Pods, we only update the PVCs if the Pod revision labels is not the update revision.
We need to unmarshal all revisions used by Pods to determine the desired PVC spec. Even if we do so, we don't want to send a apply request for each PVC at each reconcile iteration. We also don't want to replicate the SSA merging/extraction and validation logic, which can be complex and CPU-intensive.
Currently, incompatible PVCs only blocks the rolling update, not scaling up or down. Only the update revision is used for checking.
We need to unmarshal all revisions used by Pods to determine the compatibility. Even if we do so, old StatefulSets do not have claim info in its history. If we just use the latest version, then all replicas may suddenly become unavailable, and all operations are blocked.