Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1287: InPlacePodVerticalScaling BETA update #4704

Merged
merged 23 commits into from
Oct 10, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 89 additions & 57 deletions keps/sig-node/1287-in-place-update-pod-resources/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -676,6 +676,26 @@ Other components:
* check how the change of meaning of resource requests influence other
Kubernetes components.

### Instrumentation
tallclair marked this conversation as resolved.
Show resolved Hide resolved

The following new metric will be added to track total resize requests, counted at the pod level. In
otherwords, a single pod update changing multiple containers and/or resources will count as a single
resize request.

`kubelet_container_resize_requests_total` - Total number of resize requests observed by the Kubelet.

Label: `state` - Count resize request state transitions. This closely tracks the [Resize status](#resize-status) state transitions, omitting `InProgress`. Possible values:
- `proposed` - Initial request state
- `infeasible` - Resize request cannot be completed.
- `deferred` - Resize request cannot initially be completed, but will retry
- `completed` - Resize operation completed successfully (`spec.Resources == status.Allocated == status.Resources`)
- `canceled` - Pod was terminated before resize was completed, or a new resize request was started.

In steady state, `proposed` should equal `infeasible + completed + canceled`.

The metric is recorded as a counter instead of a gauge to ensure that usage can be tracked over
time, irrespective of scrape interval.

### Future Enhancements

1. Kubelet (or Scheduler) evicts lower priority Pods from Node to make room for
Expand Down Expand Up @@ -855,8 +875,8 @@ TODO: Identify more cases
- ContainerStatus API changes are done. Tests are ready but not enforced.

#### Beta
- VPA alpha integration of feature completed and any bugs addressed,
- E2E tests covering Resize Policy, LimitRanger, and ResourceQuota are added,
- VPA alpha integration of feature completed and any bugs addressed.
tallclair marked this conversation as resolved.
Show resolved Hide resolved
- E2E tests covering Resize Policy, LimitRanger, and ResourceQuota are added.
- Negative tests are identified and added.
- A "/resize" subresource is defined and implemented.
- Pod-scoped resources are handled if that KEP is past alpha
Expand All @@ -865,7 +885,7 @@ TODO: Identify more cases

#### Stable
- VPA integration of feature moved to beta,
- User feedback (ideally from atleast two distinct users) is green,
- User feedback (ideally from at least two distinct users) is green,
- No major bugs reported for three months.
- Pod-scoped resources are handled if that KEP is past alpha

Expand Down Expand Up @@ -962,20 +982,18 @@ _This section must be completed when targeting alpha to a release._

* **How can this feature be enabled / disabled in a live cluster?**
- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: InPlacePodVerticalScaling
- Feature gate name: `InPlacePodVerticalScaling`
- Components depending on the feature gate: kubelet, kube-apiserver, kube-scheduler
- [ ] Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
plane? No.
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? No.

* **Does enabling the feature change any default behavior?** No
* **Does enabling the feature change any default behavior?**

- Kubelet sets several pod status fields: `AllocatedResources`, `Resources`

* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?** Yes

- The feature should not be disabled on a running node (create a new node instead).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to "No"?

Suggested change
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?** Yes
- The feature should not be disabled on a running node (create a new node instead).
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?** No
- The feature should not be disabled on a running node (create a new node instead).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, what does "disable on a running node" mean? Can this feature be turned on/off on a single node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't remember why I added this. I don't think it should be a problem to disable it, although there are some potential race conditions around it.

Copy link
Member

@vinaykul vinaykul Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a PRR ask. Featuregate allows you to enable/disable it at will, kubelet gets the feature flag from config/cli param. There was one bug around hash calculation that would restart containers even those not using the feature, but that has been fixed. Other races might exist and hopefully get discovered is it bakes alpha/beta pre-prod deployments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR reviewer here: I think I'm following the discussion but want to be sure... Can this be made to work without creating new nodes? If so I think we're all set from a PRR perspective. I don't think we can answer this with "no" and consider the feature production ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I removed the statement that the node needs to be recreated or drained. However, if there are pending or in-progress resizes, it's possible pods could be left in an unknown state, where depending on exactly when the restart happened they could be finished resizing, not resized, or partially resized, and the pod API will not accurately reflect the state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like there are enough gotchas here that we should retain some of this in the answer. I asked PRR leads about this situation and one piece of feedback was that we shouldn't assume clusters have disposable nodes, so proscribing that nodes should be drained here would be appropriate, but proscribing that new nodes be created would not be.


* **What happens if we reenable the feature if it was previously rolled back?**
- API will once again permit modification of Resources for 'cpu' and 'memory'.
- Actual resources applied will be reflected in in Pod's ContainerStatuses.
Expand All @@ -990,69 +1008,84 @@ _This section must be completed when targeting alpha to a release._
_This section must be completed when targeting beta graduation to a release._

* **How can a rollout fail? Can it impact already running workloads?**
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?

- Failure scenarios are already covered by the version skew strategy.

* **What specific metrics should inform a rollback?**

- Scheduler indicators:
- `scheduler_pending_pods`
- `scheduler_pod_scheduling_attempts`
- `scheduler_pod_scheduling_duration_seconds`
- `scheduler_unschedulable_pods`
- Kubelet indicators:
- `kubelet_pod_worker_duration_seconds`
- `kubelet_runtime_operations_errors_total{operation_type=update_container}`


* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.

Testing plan:

1. Create test pod
2. Upgrade API server
3. Attempt resize of test pod
- Expected outcome: resize is rejected (see version skew section for details)
4. Create upgraded node
5. Create second test pod, scheduled to upgraded node
6. Attempt resize of second test pod
- Expected outcome: resize successful
7. Delete upgraded node
8. Restart API server with feature disabled
- Ensure original test pod is still running
9. Attempt resize of original test pod
- Expected outcome: request rejected by apiserver
10. Restart API server with feature enabled
- Verify original test pod is still running

* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
fields of API types, flags, etc.?**
Even if applying deprecation policies, they may still surprise some users.

No.

### Monitoring Requirements

_This section must be completed when targeting beta graduation to a release._

* **How can an operator determine if the feature is in use by workloads?**
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.

Metric: `kubelet_container_resize_requests_total` (see [Instrumentation](#instrumentation))

* **What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?**
- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
- [x] Metrics
- Metric name: `kubelet_container_resize_requests_total`
- Components exposing the metric: kubelet
- Metric name: `runtime_operations_duration_seconds{operation_type=container_update}`
- Components exposing the metric: kubelet
- Metric name: `runtime_operations_errors_total{operation_type=container_update}`
- Components exposing the metric: kubelet

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
At a high level, this usually will be in the form of "high percentile of SLI
per day <= X". It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors <= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code

- Using `kubelet_container_resize_requests_total`, `completed + infeasible + canceled` request count
should approach `proposed` request count in steady state.
- Resource update operations should complete quickly (`runtime_operations_duration_seconds{operation_type=container_update} < X` for 99% of requests)
- Resource update error rate should be low (`runtime_operations_errors_total{operation_type=container_update}/runtime_operations_total{operation_type=container_update}`)

* **Are there any missing metrics that would be useful to have to improve observability
of this feature?**
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).

- Kubelet admission rejections: https://github.com/kubernetes/kubernetes/issues/125375
- Resize operate duration (time from the Kubelet seeing the request to actuating the changes): this would require persisting more state about when the resize was first observed.

### Dependencies

_This section must be completed when targeting beta graduation to a release._

* **Does this feature depend on any specific services running in the cluster?**
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.

For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:

Compatible container runtime (see [CRI changes](#cri-changes)).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also addition of a RuntimeHandlerFeature for resize? as commented here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's more for the underlying oci runtimes to report features they have available. We could add features to the runtime status object maybe, though I think the runtimes have been compatible for a while unless the design is changing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's now RuntimeFeature which we could use if we wanted to


### Scalability

Expand Down Expand Up @@ -1126,20 +1159,19 @@ _This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**

- If the API is unavailable prior to the resize request being made, the request wil not go through.
- If the API is unavailable before the Kubelet observes the resize, the request will remain pending until the Kubelet sees it.
- If the API is unavailable after the Kubelet observes the resize, then the pod status may not
accurately reflect the running pod state. The Kubelet tracks the resource state internally.

* **What are other known failure modes?**
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.

- TBD
tallclair marked this conversation as resolved.
Show resolved Hide resolved

* **What steps should be taken if SLOs are not being met to determine the problem?**

- Investigate Kubelet and/or container runtime logs.

[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

Expand Down
7 changes: 4 additions & 3 deletions keps/sig-node/1287-in-place-update-pod-resources/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ authors:
- "@bskiba"
- "@schylek"
- "@vinaykul"
- "@tallclair"
owning-sig: sig-node
participating-sigs:
- sig-autoscaling
Expand All @@ -30,14 +31,14 @@ approvers:
see-also:
replaces:

stage: "alpha"
stage: "beta"

latest-milestone: "v1.30"
latest-milestone: "v1.31"

milestone:
alpha: "v1.27"
beta: "v1.31"
stable: "v1.32"
stable: "TBD"

feature-gates:
- name: InPlacePodVerticalScaling
Expand Down