Skip to content

Commit

Permalink
Improve some formatting
Browse files Browse the repository at this point in the history
  • Loading branch information
chelseychen committed May 19, 2020
1 parent 7ffe0bb commit 26fb915
Show file tree
Hide file tree
Showing 2 changed files with 56 additions and 20 deletions.
71 changes: 53 additions & 18 deletions keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -462,13 +462,22 @@ _This section must be completed when targeting alpha to a release._
- [x] Other
- Describe the mechanism:

(1) The API itself can be enabled / disabled at kube-apiserver level by using `--runtime-config` flag;
(2) For the use of API, we have a fallback mechanism instead of using a feature gate. That is, we simply fallback to the old Event libraries if the API is diabled.
(1) The API itself can be enabled / disabled at kube-apiserver level
by using `--runtime-config` flag;

(2) For the use of API, we have a fallback mechanism instead of using
a feature gate. That is, we simply fallback to the old Event libraries
if the API is diabled.

Currently this fallback is implemented purely in scheduler but we're
planning to move it into the library itself.

- Will enabling / disabling the feature require downtime of the control
plane?

No.
(1) Yes, enabling API requires to restart apiserver.

(2) No, enabling the use of the API doesn't require that.

- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
Expand All @@ -479,15 +488,23 @@ _This section must be completed when targeting alpha to a release._
Any change of default behavior may be surprising to users or break existing
automations, so be extremely careful here.

While the graduation of the API itself doesn't change default behavior, migration of individual components does, as the events will be reported differently.
While the graduation of the API itself doesn't change default behavior,
migration of individual components does, as the events will be reported
differently.

* **Can the feature be disabled once it has been enabled (i.e. can we rollback
the enablement)?**
Also set `rollback-supported` to `true` or `false` in `kep.yaml`.
Describe the consequences on existing workloads (e.g. if this is runtime
feature, can it break the existing applications?).

Yes. If the new Event API is disabled, it will fallback to the original one.
Yes. If the new Event API is disabled, it will fallback to the original one
(The new events are roundtrippable with the old `corev1.Events`).

If individual components don't implement it, rollback of client-library use
may not be possible (i.e. they only fallback to the old API if the new API
is disabled, if there is bug in the client-library, there is no way to
fallback as of now).

* **What happens if we reenable the feature if it was previously rolled back?**

Expand All @@ -499,7 +516,10 @@ _This section must be completed when targeting alpha to a release._
with and without the feature are necessary. At the very least, think about
conversion tests if API types are being modified.

Manual tests will be performed to ensure things work when either enabling or disabling the new Event API.
Manual tests will be performed to ensure things work when either enabling
or disabling the new Event API.

More information in [Test Plan](#test-plan) section.

### Rollout, Upgrade and Rollback Planning

Expand All @@ -509,7 +529,8 @@ _This section must be completed when targeting beta graduation to a release._
Try to be as paranoid as possible - e.g. what if some components will restart
in the middle of rollout?

A rollout could fail if some components restart in the middle of the rollout. Then those components will continue using the old Event API.
A rollout could fail if some components restart in the middle of the rollout.
Then those components will continue using the old Event API.

* **What specific metrics should inform a rollback?**

Expand All @@ -535,14 +556,14 @@ _This section must be completed when targeting beta graduation to a release._
checking if there are objects with field X set) may be last resort. Avoid
logs or events for this purpose.

The API, as a feature that workloads may in theory use, can be determined by looking into the apiserver_requests_total metric.
The API, as a feature that workloads may in theory use,
can be determined by looking into the apiserver_requests_total metric.

* **What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?**
- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [x] Metrics
- Metric name: apiserver_requests_total
- Components exposing the metric: kube-apiserver
- [ ] Other (treat as last resort)
- Details:

Expand All @@ -555,7 +576,8 @@ _This section must be completed when targeting beta graduation to a release._
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code

Events have always been "best-effort". We're sticking to that with the new API too, so no SLO will be introduced.
Events have always been "best-effort".
We're sticking to that with the new API too, so no SLO will be introduced.

* **Are there any missing metrics that would be useful to have to improve
observability if this feature?**
Expand All @@ -578,7 +600,7 @@ _This section must be completed when targeting beta graduation to a release._
For each of the fill in the following, thinking both about running user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):

There aren't any dependencies for this feature.
N/A


### Scalability
Expand All @@ -594,11 +616,18 @@ previous answers based on experience in the field._
* **Will enabling / using this feature result in any new API calls?**
Describe them, providing:

In the new EventRecorder, every 30 minutes a "heartbeat" call will be performed to update Event status and prevent garbage collection in etcd. This heartbeat is happening for events that are happening all the time (If an event didn't happen for 6 minutes, it will be GC-ed).
In the new EventRecorder, every 30 minutes a "heartbeat" call will be performed
to update Event status and prevent garbage collection in etcd. This heartbeat
is happening for events that are happening all the time (If an event didn't
happen for 6 minutes, it will be GC-ed).

* **Will enabling / using this feature result in introducing new API types?**

Yes, a new API type "eventsv1.Event" is being introduced. The migration of Event API will cause creation of new types of Event objects. The number of Event objects depends on cluster state, which theoretically won't be too large due to deduplication logic and reasonable-cardinality of objects in the system.
Yes, a new API type "eventsv1.Event" is being introduced.
The migration of Event API will cause creation of new types of Event objects.
The number of Event objects depends on cluster state, which theoretically
won't be too large due to deduplication logic and reasonable-cardinality
of objects in the system.

* **Will enabling / using this feature result in any new calls to cloud
provider?**
Expand All @@ -609,7 +638,12 @@ previous answers based on experience in the field._
of the existing API objects?**
Describe them providing:

The difference in size of the Event object comes from new Action and Related fields. We can safely estimate the increase to be smaller than 30%. We'll also emit additional Event per Pod creation, as currently Events for that are being deduplicated. There are currently at least 6 Events emitted when Pod is started, so impact of this change can be bounded by 20%. This means that in the worst case the increase in Event size can be bounded by 56%.
The difference in size of the Event object comes from new Action and Related
fields. We can safely estimate the increase to be smaller than 30%. We'll
also emit additional Event per Pod creation, as currently Events for that
are being deduplicated. There are currently at least 6 Events emitted when
Pod is started, so impact of this change can be bounded by 20%. This means
that in the worst case the increase in Event size can be bounded by 56%.

* **Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs][]?**
Expand All @@ -619,7 +653,8 @@ previous answers based on experience in the field._
* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**

The potential increase of Event size might cause non-negligible storage increase in Etcd.
The potential increase of Event size might cause non-negligible storage
increase in Etcd.

### Troubleshooting

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,14 @@ approvers:
- "@wojtekt"
- "@brancz"
creation-date: 2019-01-31
last-updated: 2020-05-13
last-updated: 2020-05-19
status: implementable
see-also:
replaces:
stage: stable
latest-milestone: "v1.19"
milestone:
stable: "v1.19"
rollback-supported: true
disable-supported: true
metrics:
- apiserver_requests_total

0 comments on commit 26fb915

Please sign in to comment.