RFC: Staging component changes in Knative #2639

mattmoor · 2018-12-05T15:38:17Z

We've had multiple instances now where we've wanted to make some sort of change to Knative where we cannot simply "rip the bandaid off" because it would make upgrades disruptive, which we need to become more disciplined about how we handle.

The goal of this proposal is to try and keep the install/upgrade process as close to kubectl apply -f foo.yaml as possible, by requiring that we stage the rollout of changes in a manner that doesn't break deployed services during the upgrade.

The general proposal in the abstract is:

In 0.a, add a new component or feature
In 0.b, enable that feature by default[1]
In 0.c, require that feature to be enabled, deprecating things replaced
In 0.d, remove deprecated old things (maybe just code).

Where: a <= b <= c < d

You might be thinking: "That's a lot of releases!", but I expect that for most rollouts a = b = c = d - 1.

[1] - If we want rollback safety then in some cases we may want this to be a < b.

Some real examples

Milestones here are purely illustrative

Example: kbuffer

In 0.3, introduce the kbuffer as a replacement for the activator.
In 0.3, have the Route controller rewrite ClusterIngress resources from the activator service to the kbuffer service.
In 0.4, exclude the activator components from the release (and either --prune should cleanup, or a Job should)

Example: Reversing the flow of metrics

In 0.3, introduce the metrics endpoint in queue-proxy
In 0.3, have the Revision controller rewrite Deployment resources to use the new sidecar.
In 0.4, start to take advantage of the fact that all Revisions have metrics endpoints.

Example: CRD versioning

In 0.10, introduce v1beta1 format (not storage, for rollback safety)
In 0.11, make v1beta1 the storage format (needs a forced update)
In 0.12, stop serving v1alpha1.

The knative-ingressgateway is another example of where we should employ this staged migration.

The text was updated successfully, but these errors were encountered:

mattmoor · 2018-12-05T15:38:54Z

cc @vaikas-google @evankanderson

mattmoor · 2018-12-05T15:39:19Z

cc @dprotaso

Perhaps this influencing your thinking around how we'd rollout CRD sub-resources?

mattmoor · 2018-12-05T15:40:03Z

cc @tcnghia

Can you think through how we rework the ingress gateway change so that it's safe?

evankanderson · 2018-12-05T21:53:36Z

Can we change the upgrade tool to be kubectl apply --prune --filename foo.yaml rather than kubectl apply --filename ...? (Where --prune will clean up other resources in the namespace with the given label.)

jonjohnsonjr · 2018-12-07T22:01:40Z

If we add a Job for releases that do part of an upgrade, we'll need to write the inverse of that Job to do the downgrade. These seem mutually exclusive, unless we write the Job in a way that knows if it's an upgrade or a downgrade (which seems hard).

With this approach, it seems like we may need different yamls for fresh install vs upgrade vs downgrade.

mattmoor · 2018-12-09T17:21:27Z

@evankanderson Yeah, I had this in one of the examples:

In 0.4, exclude the activator components from the release (and either --prune should cleanup, or a Job should)

@jonjohnsonjr Ack. Rollback safety in general requires us to introduce a component the release before we enable it by default, which is why I included this caveat:

[1] - If we want rollback safety then in some cases we may want this to be a < b.

Looking at the duality of upgrade / downgrade, this makes sense: You want everyone off of the new component before you remove it. If an upgrade adds it, then its downgrade would remove it. If the same upgrade moves folks onto it, then the downgrade will race to move things off before it is deleted. This is the same reason we must stage component removals.

mattmoor · 2018-12-09T17:24:08Z

Note that for us to start using --prune, we will have to label everything in a release before we start removing anything.

--prune=false: Automatically delete resource objects, including the uninitialized ones, that do not appear in the configs and are created by either apply or create --save-config. Should be used with either -l or --all.

evankanderson · 2018-12-27T18:12:11Z

--prune should also work with the --all flag to completely pave over the namespace. This only works if we have a single YAML/set of YAML which completely defines the namespace contents and no objects in other namespaces.

mattmoor · 2019-07-17T22:35:19Z

So the process of this is touched on here, and will be further codified in a template that @eallred-google is working on.

I think beyond that the serving operator will help rollout changes with non-trivial update mechanics without the multi-cycle staging, so we will evolve this over time.

knative-prow-robot added area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding. kind/process Changes in how we work labels Dec 5, 2018

evankanderson mentioned this issue Dec 6, 2018

Rename in-memory-channel ClusterChannelProvisioner knative/eventing#676

Closed

mattmoor added this to the Serving "v1" (ready for production) milestone Jan 28, 2019

eallred-google added the P1 P1 label Jun 6, 2019

mattmoor modified the milestones: Serving "v1" (ready for production), Serving 0.8 Jun 24, 2019

mattmoor closed this as completed Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Staging component changes in Knative #2639

RFC: Staging component changes in Knative #2639

mattmoor commented Dec 5, 2018 •

edited

Loading

mattmoor commented Dec 5, 2018

mattmoor commented Dec 5, 2018

mattmoor commented Dec 5, 2018

evankanderson commented Dec 5, 2018

jonjohnsonjr commented Dec 7, 2018

mattmoor commented Dec 9, 2018

mattmoor commented Dec 9, 2018

evankanderson commented Dec 27, 2018

mattmoor commented Jul 17, 2019

RFC: Staging component changes in Knative #2639

RFC: Staging component changes in Knative #2639

Comments

mattmoor commented Dec 5, 2018 • edited Loading

Some real examples

mattmoor commented Dec 5, 2018

mattmoor commented Dec 5, 2018

mattmoor commented Dec 5, 2018

evankanderson commented Dec 5, 2018

jonjohnsonjr commented Dec 7, 2018

mattmoor commented Dec 9, 2018

mattmoor commented Dec 9, 2018

evankanderson commented Dec 27, 2018

mattmoor commented Jul 17, 2019

mattmoor commented Dec 5, 2018 •

edited

Loading